in reply to A grammar for HTML matching

I think your approach will be worth following, but I have not seem something like this until now.

The big difference between your proposal and the existing modules, if I understand you correctly, is that you want to create (e.g., with this "tagblock" syntax) a special-purpose regex-based parser that ignores nearly all of the document and just dives into the parts that match the input specification. This could of course be faster than a full parse of the whole document tree.

The main point here is: what will be the grammar for your input specification syntax? If you could post some proposal for it in a more formal notation (e.g., BNF), it would be interesting to work on this problem.

From a performance point, the result of the compiled specification syntax should most surely not be one big regex (a lot of alternation etc. will make it slow, and the "Little Engine that couldn't" problems looms, too), but a closure that performs the needed logical tests in concert with the generated regexes.

The main point here is of course that using your approach people that would not be able to use regexes at all will be able to do some quite sophisticated regex matching. Maybe this warrants some search, too - any "user friendly" regex specification packages out there?

Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com

Replies are listed 'Best First'.
RE: RE: A grammar for HTML matching
by Anonymous Monk on Nov 02, 2000 at 01:22 UTC
    The main point here is: what will be the grammar for your input specification syntax? If you could post some proposal for it in a more formal notation (e.g., BNF), it would be interesting to work on this problem.

    I'm working on this. Having never written a BNF form grammar, It'll take me a while to put it all together. I'm playing with Parse::RecDescent to do this. I'm having conceptual problems in coming up with a syntax that encompasses all the features I want, and trying to find ideas for features I haven't thought of. For instance, say I want to grab everything between two comments? Also say I want to expand the match to include tags that start immediately preceeding the match and close immediately following it. i.e. <b><center>...match...</center></b> I want to suck in the surrounding <center> and <b> tags.

    Ideas anyone? What do you want to match?

      <cite>What do you want to match?</cite>

      Well, certainly not too complicated things. The point is probably to express the possible relationships between tags (e.g., contained in, preceding, following), and not the tags in themselves. Obviously this is not too trivial because of all these nifty exceptions that are allowed in HTML. Maybe it would be good to divide the parsing phase into a "candidate recognition" phase (purely regex based) and a "HTML parsing" phase, where you would expand the snippet found to canonical HTML syntax.

      Christian Lemburg
      Brainbench MVP for Perl
      http://www.brainbench.com

        And then you have to have a language to express the relationships between the pieces once you've parsed it.

        I'm currently interested in some discussions about using XML::XPath for that language to specify the matches. So, you'd parse HTML into something acceptable as an XPath object, then use XPath's language to pick out the items of interest, then wander that back out as your result.

        -- Randal L. Schwartz, Perl hacker