GermanHerman has asked for the wisdom of the Perl Monks concerning the following question:

I have recently been given the task of assimilating a virtully unlimited number of webpages for certain bits of information that change from page to page. The good news is that all of the pages on a particular site are created by the same cgi script. The Bad news is that these templates differ between sites. I would like to create a program that would create functions specifically designed to to pull apart a particular page template. With a little user input (unless we could do with out it.)

Thus far I have been persuing a solution that would rely on enumerating the differences between two files. And the immediate context of those differances. My big hangup has been finding those differences AND getting their context. I have been fiddling with diff on the command line with some minor success. And I think with user input I could even roll out some regexes but I would really like to do this correctly. So I am calling on anyone with ideas to post to this thread. Thank you in advance.

-Douglas
Afterthought: I guess if diff produces whole lines instead of just the exact text that differs I could have the user input the text to be extracted and have the rest be the context. Oh Boy....

update (broquaint): dropped the <pre> tags and added formatting

Replies are listed 'Best First'.
Re: Seeking input on pattern generation.
by educated_foo (Vicar) on Jul 22, 2003 at 14:44 UTC
    Algorithm::Diff would be a good starting point, since it implements the diff(1) algorithm and works on arbitrary lists. Not sure if it does context, but something like
    undef $/; my @diffs = diff([split /\s+/, <PAGE1>], [split /\s+/, <PAGE2>]);
    (untested, of course) might be a good starting point.

    /s

Re: Seeking input on pattern generation.
by waswas-fng (Curate) on Jul 22, 2003 at 15:37 UTC
    as noted above educated gives a module that lets you diff with context My question is more about what your real problem is.

    You state that I would like to create a program that would create functions specifically designed to to pull apart a particular page template What do you mean by pull apart? are you trying to write some translation machine to convert to a new template file? Trying to standardize the templates down to a core template with diffs? If you are trying to do the latter to minimize your work you need to be careful as diffs _are_ context sensitive and making a change in the master template can (and if customizations are high) will make the context portion of your diffs invalid. Let me know what the real problem is that you are trying to solve -- there may be a better way.

    -Waswas
      By pull apart I mean extract data from. This program is going to help me monitor pricing on websites that my employer buys products from very frequently. We have a bandwidth shortage so we need to precache the pricing information from the pages. I am looking at a long list of pages here and I need to get the information from an entire section of every site. I have spent a few days creating some really general scripts that work with differant site configurations, and they work alright but I am still spending too much time formulating regular expressions. These sites usually are entirly out of a database with no custom code whatsoever.

      -Douglas
        If this is a company that you do large volumes of orders from have you talked to them to see if they will provide a csv dump of their product list? Scraping data like this off an order site just seems like a huge job with little payoff.

        -Waswas