SEO Myth: There is No Duplicate Content Penalty

This is probably old news to black hats, but I often hear people say “there’s no duplicate content penalty.” Newbies worry they’ll incur some kind of penalty for having identical copyright text across 100 pages or something, and other people like me jump in to alleviate their fears: “Google doesn’t penalize duplicate content; it filters them out.”

However, a few months ago, back when I still believed supplemental results were largely due to duplicate content, I ran a test to try to figure out exactly what % similarity I had to hit for pages to squeeze into the main index. I created a directory with several pages: one original page, then several other similar pages of varying similarity, from 90% similar down to 20%. Initially, all the pages got into the main index. Then after a few months, the entire directory poofed. If Google was filtering duplicate content, then I’d assume the original page, at least, to remain indexed. I also expected a page that was only 20% similar to stay in the index. But no. Every page in that directory disappeared from both the main and supplemental index. At that point, I suspected that Googlebot was refusing to index any page in that directory.

Here’s the logic. Say I have a site with 245,230,991 pages and at least 60% of those pages are very similar. Does Googlebot really want to spend the time and effort to crawl all those pages? Keep in mind, since Big Daddy, Google’s been very picky with what pages to crawl and index. PageRank became an anti-spammer weapon built to protect Googlebot from crawling thousands of low-value spam pages with nothing but guest book links pointing at them. So if Googlebot thinks that a good number of pages within a directory are too similar, then it would make sense to not only filter those pages out but to prevent future crawling of any pages in that directory.

Caveman says something similar in this post started way back in Sep. 29, 2005 (several months before Big Daddy):

The fact that even within a single site, when pages are deemed too similar, G is not throwing out the dups - they’re throwing out ALL the similar pages…if they find four pages on the same site about a certain kind of bee, and the four pages are similarly structured, and one is a main page for that bee, and the other three are subpages about the same bee, each reflecting a variation of that bee, the site owner now seems to run the risk that they will find all of the pages too similar, and filter them all, not just the three subpages.

Anyway, today I was re-reading a Webmasterworld thread regarding Adam Lasnik’s Duplicate Content post, and happened on a few interesting comments Adam wrote:

Some guy asked: Why not build into your webmaster toolkit something like a “Duplicate Content” threshold meter?

Adam responds:

The fact that duplicate content isn’t very cut and dry for us either (e.g., it’s not “if more than [x]% of words on page A match page B…”) makes this a complicated prospect.

Todd wrote about this 6 months ago:

If it was as easy as saying that any page with more than 42% duplicate content will be filtered from the search results, then all site owners and SEO’s would probably grab 40% duplicate content for every page filler. It IS NOT a percentage.

Maybe I should go over to Bill Slawski’s blog and search for some duplicate content patents.

In regards to duplicate content penalty (emphasis mine), Adam says:

As I noted in the original post, penalties in the context of duplicate content are rare. Ignoring duplicate content or just picking a canonical version is MUCH more typical…Again, this very, very rarely triggers a penalty. I can only recall seeing penalties when a site is perceived to be particularly “empty” + redundant; e.g., a reasonable person looking at it would cry “krikey! it’s all basically the same junk on every page!”

So if I take Adam’s word for it, Google does penalize sites for duplicate content, though its a once on a DVD night kinda thing (I cancelled my Netflix like a year ago).

Related Posts

10 Responses to “SEO Myth: There is No Duplicate Content Penalty”

  1. Hey Halfdeck

    Didn’t know you had a blog, but very happy to find it and now subscribed :mrgreen:

    Rgds

    Richard (Red Cardinal)

  2. Hey Richard,

    Thanks for dropping by =) I’ll see ya around in Google Groups.

  3. Hey,

    Just found your site via Google - so I guess you don’t have any penalties going on :P

    A very interesting read. I don’t know if you watch their videos, but whenever WebProNews has Google people on they usually say the penalty either doesn’t exist, or that it only exists for purely content-stealing sites. I’m actually thinking of writing a post up about it over at my blog, still have to do more research into it though :(

    Nice blog.

  4. Thanks Jeremy.

    According to my tracker script, you asked Google “is there a duplicate content penalty” :D

    Some people will tell you there’s no internal dupe penalty, that those pages just get filtered out of results. Vanessa Fox in particular said Google is pretty good at detecting dupe content and that webmasters should not worry about any penalties.

    But like Adam said, I believe the penalty exits, albeit only in extremely rare cases. He didn’t say “Hey, it’s all basically the same junk as the content on another domain!” Instead, he said “it’s all basically the same junk on every page!”

  5. Thanks J, I have done the same test with one of my sites and you are absolutely right. The so called Filetering is Penallizing.

  6. My site is made up of mostly very similar pages, google hasn’t dropped the entire site. I still have many pages indexed which are very similar to others. Something is going on, but I don’t think it’s as black and white.

    For example, my page has 27 pages with Villas for sale in Sotogrande, and yet I am still rakning #1 on google for the keywords “chalet en venta en sotogrande” and it’s listing one result which it deems the most appropiate.

    This being accomplished with a homepage with maximum page rank of 2 which is extremely low.

    Even stranger is the fact I’ve only recently started to rank so high on google for these keywords. Since my website is a Real Estate site and we have many properties listed with very similar or the same descriptions it’s normal to be duplicate, however, it would still be bad to delist our pages from our index because they provide a very valuable service to people looking for property in this area.

    What googe might be doing is also comparing similar pages on your site with similar pages on other sites. Supposing you have a page about bees which is smiliar to other pages on your site about other types of bees but wikipedia also has a similar page about those bees as do other pages, then I suppose google will only list the best site and ingore all others related to the original topic, in this case, bees. In my case, though I may have 30 virtually exact pages about something, if there aren’t many pages on the internet with that then google will still index and list those results. At least this is the explanation I have now for what I am observing.

  7. I don’t think Google will filter out pages just because they’re similar. You can have:

    1) Two domains dressing up identical content to pass them off as unique content to Google.

    2) A domain with similar but unique content pages.

    Google would benefit from focusing on case 1 and ignoring case 2.

    Your site’s TBPR 2 is deceiving because you have enough PageRank to get 500+ pages indexed, but the home page TBPR isn’t reflecting that because from what I see (quick glance, mind you), your internal pages don’t link back up to the home page.

  8. Hi!

    I just found your blog through Google.

    There’s been a lot of confusion going around the internet today about the issue of duplicate content. I came across a webmasterworld thread where many testified that Google can really put your pages out of indexed. Some guess it’s some sort of algorithm concerns since even the original site has been deindexed too. If this is the case then it won’t be long to find out your competitors had just copied your contents seeking reasons to bring you down.

  9. I think like anything else Google isn’t perfect at dealing with duplicate content. Their claim that they handle duplicate content reasonably well is sound, but in practice there are some situations that fall through the cracks. I’ll have to dig up that WMW thread; problem with them is they don’t post actual URLs so we never really know the accuracy of the claims posted.

  10. Well that is a perspective that I have not heard before, that all pages will be thrown out. This give me some new info to think about!

What's Your Take?