comment on

I'd like to implement a generic module for scraping HTML pages. I had an idea that when extracting the important bits from a web page - the result of a search engine query, for example, or just about any list of links - one often extracts the same bits from multiple items. In these cases, it should be possible to extract just the repeating bits to get all one is looking for, in a much more regular and simpler structure - and the extracting algorithm is clearly the same for all input pages.

I've implemented HTML::ListScraper (the name by analogy with Text::Scraper as well as HTML::Parser, which my module extends), and it works as designed, but not as well as I'd like. HTML::ListScraper looks for repeating tag sequences - I don't want to search just for trees, the module should handle tag soup, too. The implementation is reasonably obvious:

Construct the tag sequence for the whole document.
Scan it to find all tag pairs and where they occur.
Throw out those that occur only once.
Extend the remaining sequences by their adjacent tags.
Repeat the previous 2 steps until there are no sequences to extend.

But, to be recognized as repeats, all these sequences have to be exactly the same. In practice, that often doesn't happen. Text content can have different tags - a bolded word here, a paragraph there, of course one can ignore such "inline" tags, but are they the same for all HTML::ListScraper users? Worse, some parts of the tag sequence can be optional. Say I'm scraping Google results: most have the "Cached" and "Similar pages" links, some don't. For a specific site, obviously one can construct specific queries - but that's exactly what I wanted to avoid... Could my module tell the calling application that "there's a sequence there but these parts are optional"? How would it find such an amorphous structure - and even if it did, wouldn't it be just too complicated to use?

So, I've decided to release HTML::ListScraper early and often and solicit some feedback: Do you think it's practically usable as it stands? Does it fail for you in interesting ways? Where would you take it, if you had an urge to take it somewhere?

In reply to RFC: HTML::ListScraper by vbar

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Think about Loose Coupling
	PerlMonks