Scraping 101: Extracting Anchor Text with Regexp

There are many ways to skin a cat, but when it comes to scraping websites, I like parsing content with regexp. One of the biggest problems I bumped into when parsing HTML is matching opening and closing tags.

For example:

(<a [^>]+>)(.*)</a>

Ok let’s try that in English:

  1. (<a [^>]+>) matches <a href=”….”.>.
  2. (.*) *should* match anchor text (I’ll elaborate on that).
  3. </a> matches the closing A tag.

<a href=”http://www.searchengineland.com” rel=”notpaid”>search engine land</a>

will correctly extract the anchor text “search engine land.” BUT because (.*) is greedy,

<a href=”http://www.searchengineland.com” rel=”notpaid”>search engine land</a> is cool because vanessa fox posts there.</a>

will incorrectly extract:

search engine land</a> is cool because vanessa fox posts there.

as anchor text. Hmm..

So how do you fix this? Instead of using a .*, use .*? or other non-greedy modifiers like +?, ??, or {m,n}? (I haven’t tested the last three, I assume they work).

(<a [^>]+>)(.*?)</a> will extract anchor text from web pages.

Related Posts

6 Responses to “Scraping 101: Extracting Anchor Text with Regexp”

  1. Interesting post but how do you propose we capture all those HTML pages to see who is linking to whom with what anchor text?

  2. “how do you propose we capture all those HTML pages to see who is linking to whom with what anchor text?”

    1. Using Yahoo API, pull backlinks to the home page only.
    2. Pull sitewide backlinks.
    3. Run a site: command, and pull the first 1,000 URLs.
    4. For each URL, pull backlicks to that URL.
    5. Hamlet Batista also suggested using keywords to pull even more backlinks.
    7. Rerun, filtering out multiple urls from the same domain. (not filtering that is useful for finding sites that link sitewide; filtering it is useful for discovering a greater number of domains).

    Obviously, even with all those API runs, this method will only dig up a subset of a site’s backlinks, and you’d have to dig through the result set to weed out noise (links from myspace search pages, nofollowed blog comment links, links from scraper sites, etc).

    If a single page links multiple times to a domain with different anchor text, that also brings up a problem most tools out there don’t deal with.

  3. […] Scraping 101: Extracting Anchor Text with Regexp, Half’s SEO Notebook […]

  4. Perl has some nice modules available for connecting to and parsing HTML documents. If you like regexp and haven’t learned perl yet you should definitely check it out. It has built-in regexp support to make parsing stuff like HTML docs quick. Here’s some basic code that extracts nofollow links and their corresponding anchor text or alt text if its an image. note the code is messy :)

    use HTML::TreeBuilder;
    use LWP::UserAgent;
    use HTTP::Headers;
    use URI::Escape;
    use HTML::Parser;

    $ua = LWP::UserAgent->new;
    $ua->agent(”Mozilla/5.0″);
    $ua->timeout(3000000);

    $req = HTTP::Request->new(GET => “http://url.com”);
    $res = $ua->request($req);

    if ($res->is_success) {
    $tree = HTML::TreeBuilder->new_from_content($res->content);
    if(defined $tree->look_down( ‘_tag’ => ‘a’)){

    @getlinks=$tree->look_down( ‘_tag’ => ‘a’);

    for($b=0;$battr(’href’)){
    if($getlinks[$b]->attr(’rel’) && $getlinks[$b]->attr(’rel’)=~/nofollow/gi ){
    print “Nofollow-> ” . $getlinks[$b]->attr(’href’) . “\n”;

    if ($getlinks[$b]->as_text) {
    print “This is a text link\n”;
    print “Anchor text-> ” . $getlinks[$b]->as_text . “\n”;
    }

    $image = $getlinks[$b]->look_down( ‘_tag’ => ‘img’);
    if ($image && $image->attr(’alt’)) {
    print “This is a image link\n”;
    print “Alt text-> ” . $image->attr(’alt’) . “\n”;

    }
    }
    }
    }
    }
    }

  5. Interesting Don. One issue I see with your code though is it probably relies on valid HTML code to work. Also if you have stuff nested inside a href, anchor may not get parsed correctly. But thanks for posting the code. Though I implement scrapers in Java if I have some free time I’ll definitely check it out.

  6. The TreeBuilder library is actually pretty mature code and was made assuming that the HTML code is bad, so it works quite nicely, would handle stuff like nested anchors fine and doesn’t break over such things as missing closing tags. Perl coders are usually really good at parsing since that’s the languages strongest feature. I doubt it is perfect, but it seems to handle most HTML docs ok. I just got a Y! API key and looped through the 1000 results and pulled anchor/image alt text tags along with their attributes off all the sites.

What's Your Take?