Supplemental Listings - How To Avoid Them

One of the biggest problems for new and veteran webmasters alike is avoiding Google's supplemental index. Pages tend to "go supplemental" especially when you launch an e-commerce site with thousands of similar, thin content pages, or if you leave your Blogspot blog's meta tags unattended (I'm guilty as charged). So, how do you avoid supplemental hell -- or if you're already there, how do you get out? There's no silver bullet, unfortunately, but here are some things you can do to improve your chances.

Update (5/11/2007): Since the release of Big Daddy, the concept of supplemental results have gotten a whole lot clearer. Sorry, I'm still too busy right now to fully update this page, but I strongly recommend you read the following blog posts first:

Got Supplementals? Accepting PageRank is Only The Beginning. In this piece, I discuss the next-gen tactical approach to getting out of "Google Hell."
Why Duplicate Content Causes Supplemental Results. Duplicate text on multiple pages do not cause supplemental results. But cannonical issues (non-www/www) that split PageRank, which is one form of duplicate content, does contribute to supplemental issues. Read this post for details.
Why Google Will Not Move Away From PageRank. 99.9% of SEOs dismiss PageRank as a non-issue. Find out why PageRank is the driving force behind supplemental results. Both Search Engine Journal and SEO By the Sea linked to this post. It's worth a read if you're still convinced META description tag tweaks will make supplemental results go away.
PageRank Doesn't Matter Is Now Officially An SEO Myth. Similar bent as the post above. Don't obsess over PageRank but do you know how it actually works? If you refer to the Toolbar PageRank as PR or PageRank, then read this post.
Supplemental Index Fuzzier Than Ever A somewhat outdated post I wrote before Matt Cutts explicitly stated PageRank is the main cause of supplemental results. Important key to remember is that trust is what helps you get your pages back in, and that trust is a link analysis thing, not just a domain authority deal. If a major portion of your inbound links (IBLs) are artificial (exchanged, paid for, injected), you may hit a brick wall. Google is becoming more and more aggressive with link devaluation. So its important to keep a well-rounded link profile that consists largely of natural, one-way inbound links. Exchanged and paid links still work, of course, but more you have them easier it is for Google to detect them. When Google doesn't trust your IBLs, adding more links may not do any good as Google will not give them much credit. And adding the wrong kind of backlinks can make Google trust you even less.

If you have any questions, just send me an email (halfdeck at gmail.com).

UPDATE: July 12, 2007:

According to Jill Whalen, Dan Crow, director of crawl systems at Google, recently said that "basically the supplemental index is where they put pages that have low PageRank (the real kind) or ones that don’t change very often." NOTE: That's Jill's take on what Dan said, not what Dan said. A Googler would never refer to internal PageRank as the "real" PageRank. TBPR is just as "real" as internal PageRank. The difference is TBPR is on a delay and is much less granular than internal PageRank. Internal PageRank is updated daily and is calculated using floating point numbers.

UPDATE: Aug 1, 2007:

Google hides the supplemental results label to shift webmasters' focus to bigger and better things. site:domain.com/& (shows supplemental results, though unlabeled) and site:domain.com/* (shows pages in the main index) can still be used. In the announcement over on Google's blog, an engineer added that URL complexity is another factor in determining "supplemental status."

UPDATE: Aug 8, 2007:

Tedster on WMW posts a link to a patent called System and method for selectively searching partitions of a database, originally mentioned by Bill Slawski.

Supplemental Results Detector Tool

A site can never be 100% supplemental results free. If you publish alot of pages, some of them will be supplemental. But the Supplemental Results Detector tool can be helpful in getting important pages back into the main index.

Consulting

If you think your site is clean and you tried absolutely everything, shoot me an email. I'm good at uncovering stuff even the top SEO's miss. I'm no marketing guru so don't ask me for viral marketing or link building advice but I can help you make the best of the visibility you already have in getting rid of supplemental results. The solution isn't simply to "get more backlinks." SEOs will tell you alot of advice but you need someone who will help you with the implementation.

Right now, my standard fee is $175 / hour (though its negotiable), and I guarantee a refund of $115 / hour if my services doesn't help you achieve results. Why do I guarantee a refund? Because I'm sick and tired of unethical, greedy SEO firms charging $200/hour, not delivering results, and running away with the dough, hiding behind the "you can't guarantee search engine ranking" BS. When I get a contract, my goal is to deliver results. Plain and simple. I value my reputation and I don't care to be known as another slimy SEO snakeoilsman.

Understanding Supplemental Results

If you get value-passing, trustworthy links from trusted sites that point directly at a supplemental url, that url will pop back into the main index. Getting too many wrong kinds of links (excessive reciprocal links, paid links, links injected into .edu domains) can make Google devalue your IBLs PageRanks, and send your site into "Google Hell." During Big Daddy, Google began devaluing artificial links more aggressively, which explains why many site owners complained about supplemental results during Big Daddy's release in the spring of 2006. A recent Forbes article (published in May 2007) mentions a site that went supplemental after Google discovered excessive reciprocal links pointing at the site.

Avoiding Supplemental Hell - The Short Answer

The number of supplemental results you have on your site depends on the quantity and quality of links from other domains linking to your site - otherwise known as PageRank - and how that juice is distributed throughout your site.

Improve a page's PageRank by getting more quality, relevant inbound links.
Improve PageRank distribution by optimizing internal links.
Maintain a natural link profile to prevent inbound link devaluation.

Simply put, improve a page's PageRank by getting more quality, relevant inbound links.

Install a 301 redirect in .htaccess from non-www to www (or the other way around, your preference.). You can also register your site with Google Webmaster Tools and set your prefered domain (though this is new, so do it at your own risk). For this domain, I used:
RewriteEngine On

RewriteCond %{HTTP_HOST} ^seo4fun\.com [NC]
RewriteRule ^(.*)$ http://seo4fun.com/$1 [R=301]
Installing 301 redirects will usually not result in supplemental pages going away, but 1) it tells Google that two urls refer to the same page and 2) unify PageRank to one page, so there's no PageRank leakage. Higher PageRank also increases the chance of pages returning to the main index.
Install a 301 redirect from /index.html to /. The simplest fix is not to use index.html/index.htm at all.
Don't link to http://www.domain.com/index.html. Instead, link only to http://www.domain.com/. Same deal with linking to index.html in directories - link to http://www.domain.com/dir/ instead of http://www.domain.com/dir/index.htm. If you use Dreamweaver, write links to home pages by hand. Why? Because Google indexes URLs, not pages. And though those URLS point to the same page, Google will treat them as unique addresses in the Interweb.
Install META ROBOTS=NOINDEX on duplicate pages to prevent Google from indexing them (or use robots.txt).
<META name="robots" content="noindex">
Check the web for copies of your page. You can use copyscape. I just take a unique snippet sample and run it through Google.
Prevent multiple URLS from referring to the same page. For example,
http://www.domain.com/index.php?user=halfdeck&page=supplemental
http://www.domain.com/index.php?page=supplemental&user=halfdeck
http://www.domain.com/index.php?page=supplemental&user=halfdeck&reply=23
You'd want to noindex two of those URLs out of the three. If other people link to those pages and/or those URLs are already in Google's index, you need to 301 redirect them.
404 invalid urls. For example, if your site is dynamic, make sure URLs like
http://www.domain.com/bogusurl/thatdoesnt/really/exist/
don't return a status 200. To check HTTP status code, I use Web Sniffer. If necessary, use a PHP header to validate urls. For example, if your url is /blue/car/ and you only have entries for red and green cars in your database, issue a 404.
Get rid of long session IDs:
http://www.domain.com/session_id=d9j5034jkfgk94HHdfgasFG5sdf
Use cookies instead.
Too many parameters in your URL. Now, things like this may change in the future, but according to Google, having too many parameters in your url may prevent it from being listed in the main index:
For example, the number of parameters in a URL might exclude a site from being crawled for inclusion in our main index; however, it could still be crawled and added to our supplemental index.
Don't put the same content on multiple domains. Well, you can, but then expect to run into supplemental issues.
Use unique TITLE/META description tags to create unique SERP listings. Each page should share as few words in the title/meta description as possible with another page. For example:
Title: SEO4FUN Search Engine Optimization - Supplemental Hell
Description: SEO4Fun search engine optimization - A checklist of things to do to avoid getting trapped in supplemental hell.

Title: SEO4FUN Search Engine Optimization - Supplemental Listings
Description: SEO4Fun search engine optimization - A checklist of things to do to stay out of the supplemental index.
are NOT unique enough. They share too many words: "SEO4FUN", "search engine optimization", etc. You want:
Title: Google's Supplemental Index - Traffic Black Hole
Description: How do you get out of supplemental hell? Is it even possible? Who do I listen to? Nobody agrees on anything!!

Title: Squeezing Out More Traffic with Del.icio.us
Description: Not getting enough traffic off Google, Yahoo, or MSN? Try riding the social bookmarking wave.
Make sure your META description tags are no shorter than 60 characters. The point of using META description tags is to prevent Google from digging into your BODY HTML to fish out some irrelevant text from your navigation bar or listbox. However, if your META descriptions are too short (and 60 chars sounds like a lot, but it really isn't), Google will still dig into your source code.
<meta name="description" content="This is an example of a short meta description tag.">

50 characters long description tag. Make it longer.
To prevent Google from making a mess, make sure your META descriptions are at least 60 chars. 60 is a roundabout figure, so in reality, you may be safe just going with a minimum of 50. Also, since Google constantly improves its snippet generation process, keep your eyes open for updates on this. But as they say, error on the side of caution.
Move navigation below content in your source. Google parses HTML to construct description snippets that shows up in site:search results. The parsing algo is still somewhat simple minded. To make sure Google finds your content, its critical that you position content above navigational elements in your source code. Sometimes, Google will fish out the right text if its buried in the source, but sometimes it doesn't. If it's positioned right below body, it will always find the right text.
Get rid of TABLES (optional...then again, not really). Use CSS. As I said, Google has problems with content buried inside structurally complex HTML. That means it can choke on content hidden away in a TR. Jill Whalen will disagree, but she's flat wrong when it comes to TABLES, since she doesn't take Google's description snippet generation into account (she also claimed PageRank doesn't matter, a statement she later retracted after Big Daddy). Remember, TABLES are for tabular data. You don't need TABLEs to make a page look pretty. For example, look at the structure of this page:
<body>
<div id="wrapper">
Sometimes, supplemental index feels like Alcatraz, but I know you can get out!
</div>
</body>
Where ID=wrapper is defined in my external CSS as:
#wrapper {
text-align: left;
width: 700px;
}
Store CSS/JS in external files. Don't clutter your HTML (or XHTML, whatever the case may be). Make life easier for Google and you improve your chance of a clean listing.
Validate your page. "look at Google, it doesn't validate!" is a line strictly for newbs. Pro's don't even need a validator to write clean code, and they don't leave a mess like amateurs do. Remember, Googlebot is NOT GOD. It's not omnicient. It doesn't know how to parse HTML perfectly. Invalid code may rank, but will all invalid HTML get indexed? At least validate your pages to a point where there are no serious validation errors. Go to w3.org to valdiate your page. Right now. I mean it.
Start off your main content with a H/P combination to make the most important part of your page easy for Google to find. This doesn't mean you should litter your page with H2 and H3 tags for purely presentational purposes or hopes of higher ranking. H tags should be used to tell visitors (and Google) how content is organized on a page.
<body>
<div id="content">

<h1>Supplemental Hell is What This Page is About</h1>
<p>If your page is 99.999% supplemental, and your home page doesn't even show up when you run a site: search on Google..well, then this page is for you.</p>
</div>

<div id="navigation">
<a href="home-page.html">home</a><a href="about-me.html">About Me</a>.....
</div>
</body>
Beef up your page. More content not only means your pages actually have some useful information for visitors, but it also means as you add more words, each page becomes less similar to other pages in the Webspace. Google will also be more reluctant to index pages without much content on them. Less content tends to mean less value (Seth's blog being one of the few exceptions to this rule).

How to Get Out of Supplemental Hell - One Way Ticket to Zero Traffic?

Once you're in there, it's not so easy to get out, or is it? Provided you took care of every item on the list above, you still probably won't see any positive changes for a while. That's because Google uses a special bot called supplemental Googlebot to refresh its supplemental cache. It's said it comes around every 6 months, but for one of my sites, the wait was around 12 months. That doesn't necessarily mean pages won't return to the main index for a year. GoogleGuy/Matt Cutts recently hinted Google may be running supplemental cache refreshes more frequently - so we have reason for hope. But how can you speed things up? Improve your trust with Google. That means:

Re-route more PageRank to pages you want in the main index by improving internal PageRank distribution. PageRank matters in indexing (read below about why).
- If a page links out excessively, add more internal links to that page.
- Start with pages already in the main index. You won't see immediate results by tweaking pages in the supplemental index. For example, if your site is made up of 4 pages (A,B,C,D) and only page A is in the main index, add a link on page A to page B to try to get page B into the main index. Also add a link from page B to page C and D in case Google refreshes the cache for page B.
- If you have alot of outbound links on a high TBPR page, try moving them off to a separate "links" page.
- Use nofollow on links to unimportant pages (e.g. "about" pages, "contact", "privacy policy") you don't need in Google's main index. You want to "borrow" those pages' PageRank and re-route it to more important pages. HOWEVER, if you have a high percentage of outbound links on every page, you may want to avoid this, you're giving away more PageRank to external sites by lowering the percentage of internal links on every page.
More content, fewer pages. According to Matt Cutts' recent post, fast site growth can hurt your trust. How fast exactly? He's talking in the hundreds of thousands of pages scale, not 5k or even 10k pages a day. So most webmasters probably don't need to sweat over this. But if things are really looking bad for your site, it won't hurt to slow down your pace a bit. Especially if you're an e-commerce site, take care not to feed hundreds of thousands of pages to Google right off the bat. If you're launching a new blog, consider sticking to just one or two categories until you have enough posts under one category to justify splitting it up into subcategories. Publish fewer, but meatier pages, instead of tons of pages with paper thin content.
We saw so many urls suddenly showing up on spaces.live.com that it triggered a flag in our system which requires more trust in individual urls in order for them to rank (this is despite the crawl guys trying to increase our hostload thresholds and taking similar measures to make the migration go smoothly for Spaces). We cleared that flag, and things look much better now.

Matt Cutts - on recent MSN Spaces large scale migration.
More quality relevant inbound links. People say this all the time. What I would focus on, is not just the links, but building the best page on the web for my topic, then getting people to notice. But because visitors aren't necessarily all web-savvy bloggers or college professors, you may want to tailor some pages to target a link-happy audience, and package your ideas in a way that encourage people to link to you. That still means once you get the attention, you need to have the content there for your primary target audience, which aren't bloggers, but people who are looking for useful information, product to buy, etc. So summing this up, I would focus on churing out TWO types of pages (or a combination of both):
1. Write the best page on the web offering either the best solution, the cheapest product, or the most comprehensive information available out there. With this page, I tried to do exactly that. Your target audience here is your buying customers or whoever you have in mind that will find your site useful, NOT webmasters (well, unless you're like me writing about SE stuff.).
2. Build link bait targetting bloggers, social bookmarkers, university professors, Democratic senators, big media, etc. These people will link to your linkbait page, improving your site's visibility and trust with Google. Link bait, in case you never heard of it, is a website/page built to grab your attention. Offering a cool free tool, flaming Matt Cutts, saying "Page Rank Doesn't Matter", publishing a Top 100 list (sorta like this page) are all examples of link bait. The focus here though is on targeting a linkhappy audience and on marketing to them instead of on delivering to your primary target audience. It's all about spreading your word through other people's websites.
Higher PageRank. Does that mean cheap reciprocal linktrades, link buying, directory listings, etc? Well, I believe they will work to an extent...but remember if the links are low quality, it won't necessarily improve your trust score, even if you see an improvement in your TBPR. Trust is what you're after when combatting supplemental issues. PageRank is a big factor in trust, but there are other things that can kill your trust score, like linking out to spammy sites. Google is already nullifying PageRank transfers off some bought links. Here's a quote by Matt Cutts on Threadwatch regarding PageRank and Google going supplemental:
I would recommend that you think of supplemental results as pages which (most likely) have less PageRank than pages in the main web index. So www.google.com/reviews?cid=b3c12ee96ed87b2d , which is a review of "The Gold Rush," a movie from 1925, is a perfectly natural url to be a supplemental result. Although it sounds like it was a good movie. ;)
Adam Lasnik agrees:
Cure?
Get more quality backlinks. This is a key way that our algorithms will
view your pages as more valuable to retain in our main index."
Improve the quality of your site. Add more meaty, unqiue pages that visitors will appreciate. Redo the design. Build the kind of pages I can't resist telling a friend about. Basically a regurgitation of point 1, but with added focus on design, functionality, originality, value, and other WOW features, like AJAX, Flash, downloadable videos, demos, tools, etc.
Improve your overall ranking. Accordning to a patent filed by Anna Patterson in 2005 (thanks to Bill for pointing it out to me in this cre8 thread), supplemental index can be the stopping point for pages that are outranked by too many other pages. Taken together with Matt Cutts PageRank statement, what does it mean? We might think about three things: 1) more inbound links 2) higher PageRank, and 3) tight, internal linking structure.
The posting list entries for up to the first K documents remain stored on the primary server 150, while the posting list entries for the remaining n>K documents are stored in the secondary index 152
So, Tighten up your internal linking structure. How? I'm working on a script (well, its already working for me, but it probably needs some debugging before it works for you) that helps me do this. I may post the script up on my blog sometime in the future. As for the importance of internal PageRank distribution, some people may tell you to just work on getting more inbound links. Unfortunately, one look at Amazon.com or Google.com will tell you a TBPR 10 site can have supplemental pages just because the site's PageRank is poorly distributed. If you already have a TBPR 7+ site, just making adjustments to how pages are linked together should solve your supplemental problem.

Supplemental Index Key Ideas

Two URLs referring to the same page splits PageRank. You lose PageRank for the page that gets filtered out, and if some people are linking to the supplemental page and if you don't have 301 installed AND if Google hasn't followed the 301 yet, the page that managed to stay in the main index is weaker due to the PageRank split.
Google will list URLS blocked by robots.txt. URLs blocked by robots.txt will still show up in Google's results if there are links pointing to that URL. To prevent them from appearing in the results at all, tag those pages with META ROBOTS=NOINDEX tag.
Supplemental Googlebot does not come around regularly (as of yet). For one of my sites, Supplemental Googlebot took exactly one year to recrawl pages that turned supplemental. That's one hell of a long wait for a newbie mistake.
Many pages listed in the main index seems to have an older copy in the supplemental index. If you see a page in the supplemental index, it doesn't necessarily mean there's anything wrong with the page. Sometimes, a supplemental listing just means Google dropped that page from its main listing.
If you have tons of duplicate content under a directory, Google may slash the entire directory, not just the dupes. Here i'm speaking strictly from personal experience. If you look for pages I ran my supplemental tests with, they are the only set of pages that doesn't appear anywhere in Google's index off this domain.
Supplemental pages have PageRank, according to Google Webmaster Help Center: What's a Supplemental Result?, which says "Please also be assured that the index in which a site is included doesn't affect its PageRank." This is important when optimizing your internal PageRank distribution, because it makes a difference if a link to a supplemental page is considered a dangling link or not. A dangling link is typically a link to a page that's not in Google's index. So in that sense, a page showing up as a supplemental result is a notch better than a page not showing up at all, because that supplemental page still contributes to the PageRank of other pages on your site it links to.

Types of Supplemental Results

These are some of the types of supplemental results you may encounter when running a site:domain.com search:

Previously indexed pages are dropped, leaving only supplemental copy behind. In this case, your page didn't "go supplemental" due to any obvious duplicate content reasons. Instead, your problem could be due to lack of PageRank or a bad outbound link to a crap site that reduces your page's trust score. A few high quality inbound link may set things straight.
A Page triggers a duplicate content filter and is flagged as supplemental. These pages will not return to the main index or dropped completely until Supplementalbot recrawls your pages. The key here is not to have Google mistake good pages for duplicates.
Pages that are correctly indexed show up as supplemental depending on the search. Google keeps several copies of the same page, older copy usually stored in the supplemental index. When a page content changes, some queries can return pages listed in the main index as a supplemental result.
Recently crawled supplemental pages with fresh cache yet to be evaluated. The two processes (refreshing the supplemental index, and evaluating them for re-inclusion into the main index) happen separately. Therefore, if you see some pages with recent cache dates listed as supplemental, it doesn't necessarily mean it will stay supplemental for long.

Hacks

To see an estimate of the number of supplemental pages listed under a domain, try: site:www.mydomain.com *** I recommend you not take the results too seriously though.

You Got a Link Out of Supplemental Results?

Don't get too excited just yet. Consider the page still on the crawl fringe. If you see a page pop into the main index, act fast because if the page falls back into the supplemental index, the page's cache will freeze for months.

Take another look at your site's internal PageRank distribution. Focus on the PageRank distributed to your page and make sure most of that PageRank stays within the site. (You can do this just by adding more internal links on a page). Note: You can't always tell how much PageRank is going into a page by looking at your toolbar. We're talking about the difference between PageRank 0.15 and PageRank 0.9812 here, which all looks to you like PR0 on the toolbar. You need to use a script to figure out what's going on.

Point more links to those pages so they stay on the right side of the fence. With a little shift in PageRank, the pages can drop right back into the supplemental index. Keep in mind, PageRank isn't a straight algorithmic calculation anymore. You want PageRank from links Google won't discount. Preferably, you want organic, not-paid-for citations from reputable sites.

My Older Supplemental Listing Notes

You might find some conflicting information, unpolished ideas, or something I've already said above repeated here in my notes (which I wrote around Feb 2006 - yeah, a long time ago). I'll hopefully clean it up later on (though I doubt it..I'm up to here with work - sucks to be my own boss sometimes). For even more info on general SEO stuff, visit my blog. You can also email specific questions at halfdeck AT gmail.com.

Oh yeah..if you want your site checked over, there are tons of willing webmasters over at Google Group Webmaster Help, including me (30 min ~ an hour a day, as of 9/10/2006 - who knows how long that'll last? So far I'm enjoying it though). You'll also occasionally come across a few Googlers including Adam Lasnik and Vanessa "Buffy" Fox. Unlike some other SEO related forums, you can post specifics, so there's a better chance people will help you iron out obvious problems with your site, if any. Beware - there are some noobies and trolls in there posting inaccurate/misleading information too (no surprise, right?), so I recommend you consider all angles before going with just one guy's advice. Even a really knowledgeable SEO can be wrong sometimes, so the best policy is play your odds (e.g. avoiding hidden text because it may get you banned) and optimize via process of elimination (e.g. getting rid of "possible" problems).

Note: according to Googleguy, "the supplemental results are a new experimental feature to augment the results for obscure queries."

How do you avoid getting pages listed as supplemental? Here are a few of my guesses that'll be tested. Note: I'm not talking about cannonical problems here, like www/non-www or / versus /index.html or sites with query strings that generate duplicate pages. I'm talking about regular pages going supplemental because somehow Google/Yahoo thinks they're similar to other pages on the web.

My guess is there are several factors that come into play when deciding whether a page belongs in the main or supplemental index. But the goal of these tests is to determine how much can we get away with before a page is flagged as supplemental, so that there's no guesswork involved when publishing pages and wondering if it'll end up in the supplemental index.

By the way, till Google temporarily dropped this page from their index a few nights ago, I was listed under WMW, digitalpoint, and Jim Boyakin on Google for "supplemental hell." Don't ask me how that happened; probably the "fresh factor" kicking in, because I'm not optimizing this page for anything at all.

Also, I just noticed the domain's /index.html got crawled. Damn Dreamweaver. This teaches me never to link to pages using the link tool. Also, I added this line to my HTACCESS (3/22/2006). Not the cleanest mod_rewrite but it gets the job done. A good reminder to have fail-safe htaccess installed before a domain is ever crawled.

RewriteCond %{REQUEST_URI} ^/index.html [NC]
RewriteRule ^/index.html$ / [R=301]

Also, this SERP shows the / and index.html as similar. They're identical pages with cache dates 3 days apart (page text was not modified during that time). This might mean one of those pages are on its way to the supplemental index or Google loosened up its supplemental filter(?) Time will tell.

Last thing, if you're using Wordpress, their default .htaccess will rewrite /blog/ to /blog/index.php. I'll see how that pans out on Google.

Possible Reasons for Winding Up in Supplemental Hell

Cannonical Problems (I'll deal with this elsewhere since most forums cover it pretty well, and they don't necessarily have negative effect to the domain as long as the key pages are indexed correctly).
- www / non-www
- http://www.xyz.com/ vs http://www.xyz.com/index.html
- /dynamic.cgi?id=x&sessionid=y&options=z generating similar or identical content.
- "Sloppy webmastering" and misconfiguration of dynamic sites can easily generate multiple urls that generate the same page.
Duplicate content (text taken from some other page on the web).
No content. (i.e. thin pages) aim for 200 ~ 250 words per page. From what I've seen, this is not true. Uniqueness of the content seems to matter more than filesize/wordcount.
Orphaned pages.
HTML head element not closed; or, body not opened. Again, not true from looking at my test pages. I left broken <head><body> tags but the pages were indexed correctly.
Similar header/footer/side nav. This may be a factor especially if the navigation links, footers, and header text comprise a big percentage of the page, making dynamically generated pages very similar to each other.
Content is buried in the bottom half of HTML code.
Large percentage of reciprocal links. hearsay.
Identical title/description. This is the easiest way to create supplemental pages.
Lack of description meta. This comes into the picture if the on-page code is similar/identical to other pages.
Similar descriptions across a site.
Lack of incoming links. I doubt this. Why? This site has pages with only one incoming link but the pages are indexed correctly. On the other hand, if you are trying to get a page listed as supplemental back into the main index, adding more incoming links will probably help increase crawl frequency.

Test Results

Google indexed a page 90% similar to the original as unique. The original page also remained in the main index. The title and the opening text was unique, that's it. Below that, I added a huge block of duplicate text, but it didn't seem to bother Google.
Files missing </head> or <body> tags also evaded getting listed as uniques, so I'm not convinced that has anything to do with creating supplemental pages.
Unique title and description or opening text on the page (H1/P) almost guarantees a non-supplemental listing.
Page size doesn't seem to matter. Even a page with only 10 words remained in the main index.

How to Get Rid of Supplementals?

The cold hard truth is once a page gets flagged as supplemental, Google will remember that page forever (unless they're going to deal with it in the near future, like Matt Cutts hinted in his blog). Even if you fix duplicate content issues, you need Supplemental Googlebot (or whatever its called) to come recrawl your page. And if you don't have enough incoming links, or if other pages in your site linking to that page is also supplemental, then you're in a sandtrap. Not enough PR/backlinks = no recrawls. The usual advice is feed the supplemental page with more direct links, add it to your sitemap, cross your fingers and pray. Your mileage will vary.
If a page that used to rank and pull traffic is showing as supplemental with old cache, add more links to that page and pull up the PR. More incoming links mean more frequent crawls, and higher PR means quicker indexing. Once a new snippet is generated for that page, it will return to the main index. If there are other duplicates, they will fall back into supplementals and your page should go back into the main index.
Deal with cannonical issues. Get rid of query strings if possible. 301 unwanted urls.
Get rid of repetitive words in your title/description; make them all as unique as possible.
Use CSS to place content above nav links in the HTML code.
Don't create new urls. This will only worsen your problem.
There are other issues to worry about too, but I'll write them down later. More incoming links, unique title/description usually does the trick.
If you are in the supplemental club and your site is clean, just wait :)

Supplemental Database and Main Index Are Two Separate Databases

From following WMW's Supplemental Club thread, I'm convinced that 1) Main index and supplemental index are two separate databases. 2) Supplemental index database is structured to be add-only. Once a docID gets flagged as supplemental, it will stay there forever. If a page goes back into the main index, the supplemental listing in Google will dissappear, but the page's docID and HTML text still remains in the supplemental database undeleted. Aside: If you check the cache copy of your supplemental page in all datacenters, you shoud see that all of the copy has an identical timestamp.

Supplemental test pages that underscored that duplicate content pages don't necessarily go supplemental. Now that I cutt off juice to those test pages, they should be supplemental.

Want more stuff to read? Check out other people's thoughts on supplemental results.

Back to SEO4FUN Blog