Word Count Tool
I wrote this little word count tool to see how much content there actually is on my page for Google to index. This tool ignores anchor text even if I linkdrop in a meaty paragraph like this one, just for simplicity sake. It's nothing unsual or spectacular. I just writing my own scripts; once I add a few more features to it though, it will be something else. Anyway, I wrote this script to see if I can prevent myself from feeding Google any more thin pages and creating supplemental pages unnecessarily. It's *supposed to be* an anti-supplemental index tool. I noticed pages with very little text often get thrown in the supplemental index. I'm not saying having no words on a page results automatically in a do-not-pass-go trip to supplemental hell, but that's been a noticeable pattern with my domains. I may expand this to show Title/description SERP snippetizer type of deal just to make sure I get the text I want to show on Google SERPS. If you like this tool or have any suggestions, shoot me an email (look for it on my blog). And why did I leave this huge chunk of text above the submit form so you have to scroll down every time you want to run the script? Google, dammit.
How to use this Tool
Dump your url in the textbox and click submit. Click re-evaluate if you want to run the script through the same page. If word count returns less than 100, you're in deep trouble buddy.
URL:
META Description:
Description Length: 0
TITLE:
NOTE: Anchor text is ignored.
Doesn't ignore stop words or short words.
Text length in characters: 0
IGNORED WORDS:Array ( [0] => )
Wordcount: 0
Link Count: 0
Unique Links: 0
External Links: 0
Array ( )
Notes
- This script pulls text off the page.
- It ignores A HREF text unless its inside a paragraph.
- It ignores META Description.
- The paragraph must be at least X number of words.
- I purposely omitted META description tag on this page to see what text Google will snippetize. It's a no brainer with a simple page like this, but hey, I just want to be sure.
- Yeah, I ran this page through my Word Count Tool to make it spider friendly. I got it up to over 400 words, so if Google refuses to index it, then I'll blame it on the lack of links to this page, not on skimpy page size. Really, if Google dumps this page in the supplemental index, Google is really broken.
Fixes & Features
- Ignores numbers, prices, etc.
- Recognizes META description. This tag should be at least 50 chars long.
- Ignores words 3 chars or less.
Known Bugs
- Mistakes relative internal links with external links.
- Doesn't work with some servers. (Gives me a 404).
- Chokes on javascript.
- Some good keywords (like seo, co-op) are being ignored. It's probably better to discard common stop words instead of going by word length.
- Comment fields bug out the script.
- Ignores words in '' which were replacements for wordpress quotes
- Doesn't stem.
- Doesn't remove stuff like she's, I've, you'll, etc.
- Double and single quotes in wordpress doesn't get parsed right.
- < in TITLE forces a parse error.
- ill-formed HTML causes probs.
- <AREA> for image mapping doesn't parse.
- - should be replaced by a space. (FIXED)
Features I may add later
- I want a sitewide analysis, and for this script to tell me the average number of words per page. Low average = bad news. For a sitewide analysis, I can only display stats (not the text). Might be cool to use XML so I can sort the result in any order I please. Besides page by page analysis, I want a site report: average words per page, average links per page, average number of internal incoming links per page, etc.
- Display text snippet that would show in SERPs for a site: search. How would this work? First, I look for an H tag. If I find it, I look for text that follows it.
- Display number of links on a page.
- Display number of unique links on a page.
- Recognize ALT text, NOSCRIPT text, and other types of text.
- Option to ignore short phrases, non-sentences, etc.
- Option to ignore short words/stop words.
- Run this script sitewide to identify THIN affiliate pages.
- Recognize affiliate links (or links with query strings, redirects, etc).
- Add an option to ignore phrases in TABLE elements.
- How do I figure out number of internal incoming links? By keeping track of links while crawling the site. Say page A links to url1, url2, and url3. Then for arrays url1, I add page A, and also add page A for url2, etc.
- How do I figure out a list of external links? Just subtract links on a page linking to urls in its own domain or url doesn't include http:
Coding Notes
This is a straight-forward file_get_contents / preg_match regexp page parser. It's my first time using preg_replace though. The code is based on an example script from php.net preg_replace page.
Copyright SEO4Fun.com All Rights Reserved.