In other words, I can't tell if you're just looking for something faster than HTML::Parser or Parse::RecDescent, or you have a different generalized approach in mind.

Well, both. The idea is that I only care about a small part of the total document. I don't want to have to examine all the irrelevant parts of the document just to get to the part I'm interested in. The benefits of this are speed and invariance to document layout. If you know the summary for the book follows <p>Summary you can ignore the rest of the document. I want to respect document structure within the segment I'm interested in, but disregard the rest.

HTML::TreeBuilder is a subclass of HTML::Parser, and while this idea could be implemented using it, the idea is that it doesn't have to be.

The applications I have in mind for this are:

  1. Ad-filtering by stripping selected portions of HTML for my pet project FilterProxy. As I've developed this, my mechanisms for specifying how to find the piece of HTML finding the ad has gone through many revisions and has been pretty convoluted (but I'm evolving toward the syntax in my original message). This "HTML matcher" idea would be perfect.
  2. scripts which extract data from web sites without using the entire web page. For instance, ShowTimes, an app for the Palm Pilot, which downloads movie theatres, times, and plot summaries for movies from several websites. (yahoo.com for movies, imdb for summaries). But every time the site(s) go through minor revisions the script breaks. I'm sure there are others who have a custom script to grab a specific piece of data from a web page. Wouldn't it be convienent to just specify a "matcher" like I've outlined?

In reply to RE: RE: A grammar for HTML matching by mcelrath
in thread A grammar for HTML matching by mcelrath

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.