Robots.txt Before Linking Up
First thing you should do after you buy a domain is install .htaccess that deals with canonical issues like non-www and /index.html.
Second thing you should do is install a robots.txt that prevents Google from crawling anything except the domain root.
If a hacker decides to submit your urls with a Google url removal tool, this may suck, but in general this will prevent alot of headaches later down the road.
If your site has low PR, Google won’t come around to deep crawl the site every day. If you screw up and send 100s of pages to supplemental hell, you’ll have even a harder time getting them recrawled and getting them back into the main index.
So imagine you get one shot for a clean crawl. Block spiders with robots.txt; build the site, check for dup problems, and only when your site’s ready, let Googlebot in.
—
Disallowing Googlebot from indexing a page in Robots.txt only seems to work if I put up a robots.txt file before Google ever spiders the page. If I use robots.txt after the pages are already crawled, the disallowed pages just get thrown into the supplementals and stay there till the end of time. If an entire site is blocked by robots.txt, it may be hidden from the SERPS but I suspect Google will still have records of the site in its database.
Evidence?
I have two directories on one domain I use to clicktrack out to sponsors. First domain I noticed these urls were cluttering up my site:index results in Google, so I put a robots.txt to block them from getting indexed. I also hoped that by putting up robots.txt, Google would eventually drop those urls from its index.
I also included a new directory in the same robots.txt, and switched many of my outgoing sponsor links to use this directory instead. After 6 months, none of the links under this directory shows up when I run site:xyz.com/directory/.
Well, what about WMW blocking their entire site using robots.txt? That seemed to work. But did those pages really get deindexed, or did they just stopped showing? My guess is they were just hidden from the SERPs.
Anyway, experience tells me robots.txt is not a tool for getting Google to drop pages from their index. Your page may be dropped from the main index, but most likely, they’ll migrate into the supplementals and stay there indefinitely.
If you want to keep Google out of a directory, put up robots.txt before that directory goes live.
Matt Cutts on Robots.txt
Matt Cutt’s wrote an interesting post on 3/17/2006 about robots.txt and why sometimes pages show up in Google SERP as URL only even when robots.txt forbade Google from indexing it:
Obscure note #1: using the ‘googlebot=nocrawl’ technique would not be the preferred method in my mind. Why? Because it might still show ‘googlebot=nocrawl’ urls as uncrawled urls. You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file. There’s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? We’d look pretty sad if we didn’t return www.dmv.ca.gov as the first result. But remember: we weren’t allowed to fetch pages from www.dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link. Sometimes we could even pull a description from the Open Directory Project, so that we could give a lot of info to users even without fetching the page. I’ve fielded questions about Nissan, Metallica, and the Library of Congress where someone believed that Google had crawled a page when in fact it hadn’t; a robots.txt forbade us from crawling, but Google was able to show enough information that someone assumed the page had been crawled. Happily, most major websites (including all the ones I’ve mentioned so far) let Google into more of their pages these days.
That’s why we might show uncrawled urls in response to a query, even if we can’t fetch a url because of robots.txt. So ‘googlebot=nocrawl’ pages might show up as uncrawled. The two preferred ways to have the pages not even show up in Google would be A) to use the “noindex” meta tag that I mentioned above, or B) to use the url removal tool that Google provides. I’ve seen too many people make a mistake with option B and shoot themselves in the foot, so I would recommend just going with the noindex meta tag if you don’t want a page indexed.
What's Your Take?