Sly_G has asked for the wisdom of the Perl Monks concerning the following question:

I have to write scripts parsing html pages (e-shops product lists, actually), and get the product names, prices, links to photos, etc. Usually they have autogenerated pages, so basically page consists of repeating chunks like
<table> <tr> <td>........</td> </tr> </table>
that have a little differences inside - that differences are names, prices, photos, etc. Obviously, one have to change parsing repexp every time the page desing or markup changes. HTML stripping technique doesn't always help, because I have to have image urls and other html information on products. The best solution would be an algorithm that could find repeating chunks and return the differences they have - that's the actual data I'm digging for. For example, if there is 20 products on a page, there's 20 similar chunks of code, and their differences are my data. Maybe there's such CPAN module or smth. Thanks!

Replies are listed 'Best First'.
Re: Search for repeating but slightly different patterns
by CountZero (Bishop) on Oct 24, 2010 at 17:32 UTC
    Just a wild suggestion:
    • Download a first page.
    • Download a second page.
    • Do a diff between the two pages and you have the content that has changed.

    A module such as Algorithm::Diff could be useful here.

    PS: if this algorithm works, you may use it for free, just call it "CountZero's Universal Scraper Algorithm"!

    Update: Could be as easy as this:

    use strict; use warnings; use 5.012; use LWP::Simple; use Text::Diff; my $first_content = get('http://www.thinkgeek.com/geektoys/all/'); my $second_content = get ('http://www.thinkgeek.com/geektoys/feature/d +esc/1/60'); my $diff = diff \$first_content, \$second_content; say $diff;
    of course you will have to parse the diff-output!

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Search for repeating but slightly different patterns
by jethro (Monsignor) on Oct 24, 2010 at 17:08 UTC
    One thing that might help you is to use modules that can parse the HTML. HTML::Tree::Scanning is an article that explains how to do scanning with HTML::Parser and HTML::TreeBuilder

    Just a wild idea: An algorithm to find the repeated structures might work, but humans are so much better at that sort of pattern matching. Since complete website redesigns are not that common, you might instead give some special user the power to call for a listing of the website html structure and to interactively mark the start of the repeatable structure and the element with the product name. That would work only if that user knows html a bit, but would make updating a changed website relatively easy and fast.

      Having done this kind of thing a few times, let me suggest that IMHO, this is a case where regex is better than parsing. Ignoring sites continually changing their formats, the main problem I found was that there were a lot of web-sites using broken html, as in it wouldn't validate and thus parsing was problematic.

      This is one of those cases where, in theory, parsing is better, but it has to be very forgiving parsing, to the point that I found it unreliable. Disheartening.

        I'm with you on this. If all sites were made under strict html standard, web-world would be a much better place. So, regexp it is!
Re: Search for repeating but slightly different patterns
by JavaFan (Canon) on Oct 24, 2010 at 17:06 UTC
    The best solution would be an algorithm that could find repeating chunks and return the differences they have
    Uhm, either chunks repeat, or they are different. They cannot be both.
    HTML stripping technique doesn't always help, because I have to have image urls and other html information on products
    Have you tried parsing the pages instead of using regular expressions?
Re: Search for repeating but slightly different patterns
by planetscape (Chancellor) on Oct 25, 2010 at 06:10 UTC

    There are also modules that allow "approximate" or "fuzzy" matching; some deal with the "similarity" of strings; and many relevant threads may be found under Mostly Regex Stuff.

    HTH,

    planetscape
Re: Search for repeating but slightly different patterns
by aquarium (Curate) on Oct 24, 2010 at 22:43 UTC
    assuming you (the developer) know ahead of processing which site will produce which variant of presentation...you can have unique code (per site) to scrape and translate into a stable design internal representation, and from there on, your code always just works with the internal representation. so in effect this strategy de-couples the initial read of the html from the rest of the code. the hard part is determining which form of internal representation (data structure) will work for all cases, and give you consistent access to that data.
    the hardest line to type correctly is: stty erase ^H