in reply to RE: A grammar for HTML matching
in thread A grammar for HTML matching

In other words, I can't tell if you're just looking for something faster than HTML::Parser or Parse::RecDescent, or you have a different generalized approach in mind.

Well, both. The idea is that I only care about a small part of the total document. I don't want to have to examine all the irrelevant parts of the document just to get to the part I'm interested in. The benefits of this are speed and invariance to document layout. If you know the summary for the book follows <p>Summary you can ignore the rest of the document. I want to respect document structure within the segment I'm interested in, but disregard the rest.

HTML::TreeBuilder is a subclass of HTML::Parser, and while this idea could be implemented using it, the idea is that it doesn't have to be.

The applications I have in mind for this are:

  1. Ad-filtering by stripping selected portions of HTML for my pet project FilterProxy. As I've developed this, my mechanisms for specifying how to find the piece of HTML finding the ad has gone through many revisions and has been pretty convoluted (but I'm evolving toward the syntax in my original message). This "HTML matcher" idea would be perfect.
  2. scripts which extract data from web sites without using the entire web page. For instance, ShowTimes, an app for the Palm Pilot, which downloads movie theatres, times, and plot summaries for movies from several websites. (yahoo.com for movies, imdb for summaries). But every time the site(s) go through minor revisions the script breaks. I'm sure there are others who have a custom script to grab a specific piece of data from a web page. Wouldn't it be convienent to just specify a "matcher" like I've outlined?

Replies are listed 'Best First'.
RE: RE: RE: A grammar for HTML matching
by dchetlin (Friar) on Nov 01, 2000 at 10:48 UTC
    I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for, and I don't see why your approach would have to worry less about the document layout. I don't see what about your approach makes it less likely that a minor revision to a site would break something.

    It's quite possible I'm just not thinking along the right lines, though. It's obvious from your FilterProxy page that you know what you're doing -- if you have ideas of how this approach will be implemented, I encourage you to do it. Perhaps I'd understand once I see an actual example or some code.

    -dlc, who currently uses tchrist's web proxy, but might try out yours tomorrow...

      I guess I'm still not seeing it -- I've never run into an application like that that HTML::Parser didn't work for

      Oh, HTML::Parser works. It's just painfully slow. I started using HTML::Parser for this application, and the minimum time to traverse just about any document is about 1 second. As the complexity of the matching code grew, it got even slower. HTML::Parser would be a more appropriate solution if it could only called start() for some set of specified tags. Or for tags with an attrib that matches some regex. But now I'm getting away from HTML::Parser and starting to specify the grammar I'm interested in. By comparison, by using a regex and then growing it to include appropriate tags, I can do these matches in 0.01 seconds (sometimes ;).

      Another way to look at this is that writing these HTML matching rules is simply much easier and faster than writing a HTML::Parser script. (Ok, so maybe some of you uber-hackers can whip up HTML::Parser in seconds ;) These matcher scripts are far shorter than the HTML::Parser based script to implement them, and don't require knowledge of perl.

      -bsm, who has looked at tchrist's proxy, and just looked again, and it's faster than I had thought (HTML::Parser based). But man it mangles pages. But speed is only tangentially related to this idea. I didn't want to debate the merits of HTML::Parser, but rather see if this HTML matching idea has merit. Even if it's implemented using HTML::Parser.