Archive for the 'Regexp' Category

Scraping 101: Extracting Anchor Text with Regexp • Friday, February 8th, 2008

There are many ways to skin a cat, but when it comes to scraping websites, I like parsing content with regexp. One of the biggest problems I bumped into when parsing HTML is matching opening and closing tags.
For example:
(<a [^>]+>)(.*)</a>
Ok let’s try that in English:

(<a [^>]+>) matches <a href=”….”.>.
(.*) *should* match anchor text (I’ll […]