I'd like to implement a generic module for scraping HTML pages. I had an idea that when extracting the important bits from a web page - the result of a search engine query, for example, or just about any list of links - one often extracts the same bits from multiple items. In these cases, it should be possible to extract just the repeating bits to get all one is looking for, in a much more regular and simpler structure - and the extracting algorithm is clearly the same for all input pages.
I've implemented HTML::ListScraper (the name by analogy with Text::Scraper as well as HTML::Parser, which my module extends), and it works as designed, but not as well as I'd like. HTML::ListScraper looks for repeating tag sequences - I don't want to search just for trees, the module should handle tag soup, too. The implementation is reasonably obvious:
But, to be recognized as repeats, all these sequences have to be exactly the same. In practice, that often doesn't happen. Text content can have different tags - a bolded word here, a paragraph there, of course one can ignore such "inline" tags, but are they the same for all HTML::ListScraper users? Worse, some parts of the tag sequence can be optional. Say I'm scraping Google results: most have the "Cached" and "Similar pages" links, some don't. For a specific site, obviously one can construct specific queries - but that's exactly what I wanted to avoid... Could my module tell the calling application that "there's a sequence there but these parts are optional"? How would it find such an amorphous structure - and even if it did, wouldn't it be just too complicated to use?
So, I've decided to release HTML::ListScraper early and often and solicit some feedback: Do you think it's practically usable as it stands? Does it fail for you in interesting ways? Where would you take it, if you had an urge to take it somewhere?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RFC: HTML::ListScraper
by rinceWind (Monsignor) on Apr 25, 2007 at 08:22 UTC | |
by vbar (Novice) on May 27, 2007 at 20:22 UTC | |
by Anonymous Monk on Jun 22, 2007 at 07:23 UTC |