4th Sep 2015

Double Trouble: Duplicate Content

I am a copywriter. I spend my days researching and writing, then researching some more. My bills are paid by this cycle, and I have become a total devotee and slave to it. With some shock and amazement, I realised recently that I had been doing this same thing day in and day out … for over two decades! That’s a lot of words …

In all those years of writing, because of style and composition, I must have written duplicate content, whether in word or in theme, and whether intentional or not. Since the rise of digital marketing the phrase ‘duplicate content’ has become dirty words, suggesting skull duggery and nefarious intents, but is that true? Does Google really set so much faith in virgin content, or is it some fairy tale that is used to scare poor interns with no real experience of wordsmithing?

To try and see what evil duplicate content is really about, I set out on a quest to identify and challenge it.

My first stop was to http://www.copyscape.com/

Copyscape is well known in the content marketing industry, and mostly used as yet another stick to beat the poor intern by. The program was written to verify an article was original by checking it against others on the internet and seeing if it was duplicate content.

For content, I copied pages from random websites to see what result I got.

My first chosen page, about horse abscesses, with pictures, was chosen as I thought, ‘Who would want to copy this?’, and was duly run through Copyscape.  It was a fail. Plagiarism was found. So I moved on to the next page, one written by a friend which I knew was original, and it too failed. So I tried more pages from all kinds of documents, including my own thesis, which I knew were not on the net, and guess what – they all failed. So I tried another online plagiarism checker, www.grammarly.com 

They all failed again. Every single one of them. I used translations from foreign languages (What would be the point of plagiarising those? Have you ever used Google translate?), children’s fairy tales, car sales publications and high school transcripts from my children’s school. Not one of them was deemed original.

So on to another, just in case… http://www.plagspotter.com/ This one had to be better. It bragged that 41% of its content checked that day had been plagiarised. That meant 59% was original and copy free – Nope! According to that site, all the content was plagiarised, it’s just that some were only plagiarised 41% and some 59%.

Did this really mean that there was no original content in the world anymore? Have all the words worth writing, been written? I am pretty sure the Hawk eyed professors at Oxford University that read my thesis would have noticed the ‘significant plagiarism’ the websites professed it contained – if it contained any. From Greek philosophy to labels on jars, my tutors read it all, and retained it. I knew it was original content, and more importantly, so did they. I have a certificate to prove it!

So why the disparity between the internet and real life? Like so many aspects of a modern life, it just means once again we have put the future into a machines hand, and it’s all been screwed up. Just look at the ‘plagiarism’ the websites noted:

‘Unidentified plagiarism’, on several pages’… if they can’t identify it, how do they know its plagiarism?

‘To be, or not to be…’ phrase found on 4357,528,521 other pages’. Shocker.

‘Plagiarised colloquialism’. Correct me if I’m wrong, I only have a degree in English, but isn’t that an oxymoron?

“Repeated use of the phrase, ‘Once upon a time’.” Hans Christian Anderson

‘The Tenant of Wildfell Hall by Anne Bronte’ found on 57,698 other sources’. My English Thesis. As it was an in depth critique of the social issues surrounding the Victorian Patriarchal hierarchy in ‘The Tenant of Wildfell Hall by Anne Bronte’, it was hardly a phrase I could leave out.

What’s the problem?

The problem with these checkers is that they can only detect word for word duplicates. It may be that each of the sites I checked were set to ‘So super sensitive that a sentence only had to have more than two ‘e’s’ in it to be flagged as a fraud’, but it is more likely that that is the limitation of the service. If you used a local vernacular or oft referenced proverb, the machine considered it plagiarised.

There are some cases where it is perfectly honest to have word for word copies, like quotes for example. It would appear that even if you use quotation marks, and cite your resources and provide a link to identify it as a direct quotation, it will still be flagged as duplicate. Proverbs, sayings and even local known sites and phrases also fall under these categories, as do forums where they copy and paste another’s post to refer to. Not to mention syndication – and bona fide marketing tool for SEO rankings? It is becoming more and more clear that not all duplicate content is equal.

Should we change the way we write to accommodate these limitations?

No, never. The short answer is ‘no’, the long answer is, ‘Noooooooooooo’. Why do we want to write to please machines? Its’ the people we are writing for! Its people we want to engage. A machine has no part of the equation. If two people write about the same subject, and their copy just happens to be similar in essence and style, does that mean one is plagiarized or one worse than the other? What about sequels? A second book in a series could be flagged as plagiarized because it’s the same writing and the same vocabulary – should we disallow it?

How you can see how ridiculous living our lives by the predefined parameters of a machine can be, it’s time to work why ‘duplicate content’ has become such a sin.

The Rise of SEO

To a large degree Google has to hold their hand up as being responsible on this one. Google has a policy about duplicate content and it goes like this:

“Duplicate content

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:

Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices

Store items shown or linked via multiple distinct URLs

Printer-only versions of web pages

If your site contains multiple pages with largely identical content, there are a number of ways you can indicate your preferred URL to Google. (This is called “canonicalization”.) More information about canonicalization.

However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.”

This policy covers everything – and nothing at the same time. ‘Multiple pages’? ‘Substantive blocks’? ‘Appreciably similar’? They convey the message that creating a website using other peoples material is not the ‘done’ thing, but what kind of metrics are we looking at? Is one paragraph enough to send you to blacklist hell, or one hundred? How many pages are a multiple? Two, or two thousand? There is no indication of where the parameters are.

Google also have to admit they are creating a market for SEO enriched text, so copywriters are being pushed to write more content – and quicker. Now we live in a worldwide market place the price of written content is dropping so some writers cut corners to meet demand. Cut a pasting is one way to speed up content and keep the engagement up.

Using sweet experience

Andy Crestodina of Orbit Media wrote an excellent article about the whole conception of duplicate content. He called it ‘3 myths about duplicate content’.

He starts the article by challenging people to tell him of examples they knew of where duplicate content had been penalized by Google. Out of the many comments, no one could. Not even a half sniff at a penalty. The only example came from Crestodina himself. He wrote:

“I have never seen any evidence that non-original content hurts a site’s ranking, except for one truly extreme case. Here’s what happened:

The day a new website went live, a very lazy PR firm copied the home page text and pasted it into a press release. They put it out on the wire services, immediately creating hundreds of versions of the home page content all over the web. Alarms went off at Google and the domain was manually blacklisted by a cranky Googler.

It was ugly. Since we were the web development company, we got blamed. We filed a reconsideration request and eventually the domain was re-indexed.

So what was the problem?

Volume: There were hundreds of instances of the same text

Timing: All the content appeared at the same time

Context: It was the homepage copy on a brand new domain

It’s easy to imagine how this got flagged as spam.

But this isn’t what people are talking about when they invoke the phrase “duplicate content.” They’re usually talking about 1,000 words on one page of a well-established site. It takes more than this to make red lights blink at Google.

Many sites, including some of the most popular blogs on the internet, frequently repost articles that first appeared somewhere else. They don’t expect this content to rank, but they also know it won’t hurt the credibility of their domain.”

And he’s right. In all the research I conducted about Google’s parameters for duplicate content I never found anything concrete – just a lot of conjecture with no real life examples. It seems that the extreme of ‘duplicate content’ is taken notice of, but until the push the button of ‘extreme’, it would seem Google doesn’t care.

Plagiarism vs Duplicate Content

It always makes me smile that people appear to be more afraid of a Google penalty – than the law.

Plagiarism and duplicate content are not the same thing.

Quora states:

“Duplicate content is content you copy from another site and paste on your own. Accrediting it to the source or not does not change the fact that it is duplicate content. While google does not necessarily penalize for duplicate content that is accredited to the source, it probably would not rank higher since google would be thinking along the line of there is someone else who wrote this article so that person should rank higher.

Plagiarism is more like stealing/using someone’s publication and stuff and claiming it as your own original work.”

If you have duplicate content Google may black list you. If you have plagiarized content, you can get sued for copyright. Which would you prefer?

 

The real crux of the matter is … Google cannot police duplicate content by bots. It takes a human to look over content and see if it’s plagiarized or duplicated; or just plain badly written.

Am I condoning either plagiarism or duplicate content? No, not at all. Plagiarism is illegal, and duplicate content is lazy writing and unengaging content. Neither of which are attractive.

Andy Crestodina puts everything into perspective at the end of his article when he wrote:

“Calm down, People.

In my view, we’re living through a massive overreaction. For some, it’s a near panic. So, let’s take a deep breath and consider the following…

Googlebot visits most sites every day. If it finds a copied version of something a week later on another site, it knows where the original appeared. Googlebot doesn’t get angry and penalize. It moves on. That’s pretty much all you need to know.

Remember, Google has 2,000 math PhDs on staff. They build self-driving cars and computerized glasses. They are really, really good. Do you think they’ll ding a domain because they found a page of unoriginal text?

A huge percentage of the internet is duplicate content. Google knows this. They’ve been separating originals from copies since 1997, long before the phrase “duplicate content” became a buzzword in 2005.

Disagree? Got Any Conflicting Evidence?

When I talk to SEOs about duplicate content, I often ask if they have first-hand experience. Eventually, I met someone who did. As an experiment, he built a site and republished posts from everywhere, verbatim, and gradually some of them began to rank. Then along came Panda and his rank dropped.

Was this a penalty? Or did the site just drop into oblivion where it belongs? There’s a difference between a penalty (like the blacklisting mentioned above) and a correction that restores the proper order of things.

If anyone out there has actual examples or real evidence of penalties related to duplicate content, I’d love to hear ’em.”

Me too! I’d really like to know if setting off all the ‘duplicate content’ alarms at Google HQ by quoting other articles was worth it … or whether I’ll be writing my next article for an audience of no one, from my lowly position of complete Google outcast.