Sitewide Duplicate Content Checker Tool Useless?

SEO Junkie just released an clientside application you can use to check for duplicate content on your site. I found a link to his app through Search Engine Watch yesterday while I was playing around with Google Co-op. The application is meant for small sites and he warns that there’s alot of functionality missing. When he said “small sites” I hoped he meant less than 10,000 pages, but in the end I walked away scratching my head.

First, even when checking just two urls, its incredibly slow, compared to something like the Similar Pages Checker. Second, it indiscriminantly crawls every URL it finds: affiliate redirect links, links to images, videos, etc. It took me a few minutes to figure out why I was suddenly getting hit with a barrage of pop-ups. Third, it crapped out at the middle of crawling my site with an error message: Runtime Error: Method ‘~’ of object ‘~’ failed. I went over to his blog and saw similar error messages being reported. It could be due to hitting a memory cap or something, but why not just elegantly stop crawling links when you get above a certain threshhold?

My biggest complaint though, is the fact that you can’t use this to check large sites. Those are exactly the kind of sites I’d want to check for page similarity. I mean, why would I want to run my 30 page blog through something like this?

Suggestions:

  • Reference robots.txt and/or check robots tag and ignore disallowed/noindex pages.
  • Do not follow redirect urls that lead outside of a given domain.
  • Automatically limit number of pages crawled to prevent the program from crashing.
  • Ignore image, audio, video files (or does it already ignore them?)
  • Give users an option to compare urls in a subdirectory instead of comparing each page to every other page in a domain.

Related Posts

What's Your Take?