http://qs1969.pair.com?node_id=611933

I'd like to implement a generic module for scraping HTML pages. I had an idea that when extracting the important bits from a web page - the result of a search engine query, for example, or just about any list of links - one often extracts the same bits from multiple items. In these cases, it should be possible to extract just the repeating bits to get all one is looking for, in a much more regular and simpler structure - and the extracting algorithm is clearly the same for all input pages.

I've implemented HTML::ListScraper (the name by analogy with Text::Scraper as well as HTML::Parser, which my module extends), and it works as designed, but not as well as I'd like. HTML::ListScraper looks for repeating tag sequences - I don't want to search just for trees, the module should handle tag soup, too. The implementation is reasonably obvious:

  1. Construct the tag sequence for the whole document.
  2. Scan it to find all tag pairs and where they occur.
  3. Throw out those that occur only once.
  4. Extend the remaining sequences by their adjacent tags.
  5. Repeat the previous 2 steps until there are no sequences to extend.

But, to be recognized as repeats, all these sequences have to be exactly the same. In practice, that often doesn't happen. Text content can have different tags - a bolded word here, a paragraph there, of course one can ignore such "inline" tags, but are they the same for all HTML::ListScraper users? Worse, some parts of the tag sequence can be optional. Say I'm scraping Google results: most have the "Cached" and "Similar pages" links, some don't. For a specific site, obviously one can construct specific queries - but that's exactly what I wanted to avoid... Could my module tell the calling application that "there's a sequence there but these parts are optional"? How would it find such an amorphous structure - and even if it did, wouldn't it be just too complicated to use?

So, I've decided to release HTML::ListScraper early and often and solicit some feedback: Do you think it's practically usable as it stands? Does it fail for you in interesting ways? Where would you take it, if you had an urge to take it somewhere?

Replies are listed 'Best First'.
Re: RFC: HTML::ListScraper
by rinceWind (Monsignor) on Apr 25, 2007 at 08:22 UTC

    Release early and often is a good approach with CPAN modules. I notice that CPAN testers is showing one failing test (though cpantesters is not yet displaying the results). If I were you, I'd make fixing the failing test a priority for 0.02.

    My first thoughts on the module documentation are that it's not clear when you would want to use it, and what the advantages are over HTML::TokeParser or HTML::TreeBuilder. If this is spelled out loud and clear in the description section, more people will be inclined to install and use your module.

    What would be really good is a worked example. Use some real website that's out there, maybe one you are hosting yourself. Together with a tutorial pod file, this would go a long way to promoting use of your module.

    --
    wetware hacker
    (Qualified NLP Practitioner and Hypnotherapist)

      Making glacial progress... I've fixed the failing tests, and as for when to use HTML::ListScraper, the principal use case is parsing search engine results. But documenting a worked-out example would IMHO be misleading - the module just doesn't work well enough for lots of people to start using it right now...

      HTML::ListScraper is different from HTML::TokeParser and HTML::TreeBuilder in that it doesn't return the same information (for the same input document); it drops the "irregular" parts, leaving something smaller and hopefully easier to interpret - except that as it stands, it drops rather too much...

      Recently I've been reminded that biologists have an interest in sequence matching, and some interesting algorithms I could try, but they don't seem implemented as CPAN modules, so the next step looks like implementing that before trying to incorporate some form of sequence alignment into HTML::ListScraper (a bit like Algorithm::AhoCorasick, which turned out to be completely unnecessary :-) ). And obviously the algorithms will have variations and alternatives I've no idea about - any bioinformatics specialists around here?

        Thanks for the module. I was looking for something similar for a while. The name did not clearly tell me what the module was doing. I installed HTML::ListScraper. The document talks about the example script scrape. This does not get installed with cpan install. I have to go back to the distribution to get the scrape script. This is just a small inconvenience. When I tried it on my example HTML file, I found that the approximation is splitting into finer blocks. I could not figure out a way to tune this parameter. Also, I would have liked to try approximation if the exact repetition (something like a suffix tree + largest repeating string combination) fails. Thanks once again. -Sreenivasa