There are many ways to skin a cat, but when it comes to scraping websites, I like parsing content with regexp. One of the biggest problems I bumped into when parsing HTML is matching opening and closing tags.
For example:
(<a [^>]+>)(.*)</a>
Ok let’s try that in English:
(<a [^>]+>) matches <a href=”….”.>.
(.*) *should* match anchor text (I’ll […]