Google Cache
I just checked my site at 216.239.59.147 and noticed a huge drop in pages indexed. Either I’m still doing something wrong or Google is hiccupping again.
I need to check my pages on this DC and see how many of my pages including subdomains are indexed correctly.
Since Google keeps falling back to cache from August these days, I’m wondering if
A. Google keeps several versions of document in its cache per DC
B. Google keeps one version of document per DC.
Also, considering BD seems to shift in and out of a DC, I’m not sure what the hell I’m seeing.
Maybe my site is lacking PR to get crawled often, but considering the major supplemental “bug” hitting alot of well-established sites out there, lack of PR may only be a small part of the problem.
Question: Are the pages listed as supplementals supposed to be supplemental? (definitely, some pages are supposed to be supplemental; usually every domain naturally has a few supps). Were the pages close call dupes that got tipped over into supplementals due to a bug in the dup filter aglo?
I’ll have to take a look at a few of my competitor sites to see how their pages are holding up in Google.
I just did a &filter=0 to my site and the page count jumped from 700 to 10,900!? Now what in the hell is going on?
Another thing, how did Google crawl/cache sites again? It has a few Mozilla/regular crawlers (around 300?) and they cache the information where? In a single repository or do they dump their info on separate DCs (which would make no sense). They obviously must all share the same docID and url hash…but what about the HTML cache? Is there some status field which prevents a page from being displayed depending on the DC?
Why would a site show on one DC and not display at all on another? I’m missing a basic SEO 101 info here…
Here’s an interesting speculation by lammert on supplementals at wmw (msg #61):
The search engine index is primarily designed to store pages, pages and even more pages. The rate at which new pages occur is higher than the rate at which old pages disappear. This was at least the situation a few years ago. As a programmer I wouldn’t be surprised if Google designed the database to be add-only, and solved the delete problem just as DBase did, by marking unwanted records with a flag.
According to g1smd at wmw, Google has a database for supplementals apart from a live cache, and that supplemental cache once it gets in there is permanent, even when hosting is taken down and a domain no longer exists.
Think of this in terms of MYSQL tables: google_supplemental DocID / url hash / cache date/ cache content.
google_live DocID / urlhash / cache date/ cache content.
Whatever record gets added in google_supplemental is never deleted.
Interestingly, steveb added: “there are two sets of supplementals on different datacenter groups, so depending on what datacenter you hit you could see a different batch.”
I can’t confirm this, but it’s worth noting.
Another quote, this one from Dayo_uk: ”
These pages have been crawled and cached in the supplimental index but not been crawled or cached in the normal index….The question that people should be asking themselves is why Google are now not listing there pages in the normal crawl as theses have disappeared rather than the pages going supplimental (as a supplimental copy was probably already there).
So…pages didn’t go supplemental, just that pages in the main index is no longer being displayed(?)
I think Dayo_UK is on to something here. The pages that were crawled correctly (e.g. the homepage, which never had a supplemental problem) is showing perfectly in BD DCs.
Now I ran another subdomains page count, a few minutes ago I saw 584 indexed on 64.233.179.104. Now I see 684 on the same DC. Even with filter=0, the resulting number is the same. I guess it could be a timing thing…but where is that 10,900 number coming from? Is that just an approximation? I ran page counts on every directory and they don’t add up anywhere close to that huge number.
I know Google sometimes hides pages, but usually if I do a site: targetting a directory, it will show the pages that are being suppressed. I just can’t figure it out.
Even with a 684 page count, Google is only displaying 276, which is pretty much identical to the number of pages Yahoo has indexed. Am I missing something? And why are the rest of the stuff not being displayed? Are they supplemental? filter=0 won’t display them.
Is this some kind of subdomain penalty??
site:janesguide.com -inurl:www returns 607 pages today on the same DC, around 86 urls as unique. Lets see why they could be supplemental.
- Some urls with 5 query strings are supplementals (they should be).
- Some urls are not supplental but are hidden as “similar pages” due to identical META description: “Since 1997, JanesGuide and Jane Duvall have been your guide …”
- All subdomain root urls are cached correctly and 607 out of 607 shows up in the SERP.
- Interesting to note that the pages are all light, around 7k
- The nav menus are below content, and there are 240 unique words on the page.
At around 350 urls, the rest are going supplemental.
I refresh a few minutes later and now the page count is 718, and a 10k page count with filter=0. 306 documents shown.
I only use a few &query type urls, and I have most of them blocked by NOINDEX. Could be Google indexed alot of them and are completely suppressing them.
I’ll have to email myself when generating 404 pages just to see if bots are crawling some non-existent pages.
What really bugs me is out of say 724 pages, only around 250 are actually displayed in the SERPS. Some of those hidden pages are supplementals. But what about the rest of the pages?
What's Your Take?