in reply to RE: A grammar for HTML matching
in thread A grammar for HTML matching

The main point here is: what will be the grammar for your input specification syntax? If you could post some proposal for it in a more formal notation (e.g., BNF), it would be interesting to work on this problem.

I'm working on this. Having never written a BNF form grammar, It'll take me a while to put it all together. I'm playing with Parse::RecDescent to do this. I'm having conceptual problems in coming up with a syntax that encompasses all the features I want, and trying to find ideas for features I haven't thought of. For instance, say I want to grab everything between two comments? Also say I want to expand the match to include tags that start immediately preceeding the match and close immediately following it. i.e. <b><center>...match...</center></b> I want to suck in the surrounding <center> and <b> tags.

Ideas anyone? What do you want to match?

Replies are listed 'Best First'.
RE: RE: RE: A grammar for HTML matching
by clemburg (Curate) on Nov 02, 2000 at 18:48 UTC

    <cite>What do you want to match?</cite>

    Well, certainly not too complicated things. The point is probably to express the possible relationships between tags (e.g., contained in, preceding, following), and not the tags in themselves. Obviously this is not too trivial because of all these nifty exceptions that are allowed in HTML. Maybe it would be good to divide the parsing phase into a "candidate recognition" phase (purely regex based) and a "HTML parsing" phase, where you would expand the snippet found to canonical HTML syntax.

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

      And then you have to have a language to express the relationships between the pieces once you've parsed it.

      I'm currently interested in some discussions about using XML::XPath for that language to specify the matches. So, you'd parse HTML into something acceptable as an XPath object, then use XPath's language to pick out the items of interest, then wander that back out as your result.

      -- Randal L. Schwartz, Perl hacker