in reply to RE: RE: A grammar for HTML matching
in thread A grammar for HTML matching

<cite>What do you want to match?</cite>

Well, certainly not too complicated things. The point is probably to express the possible relationships between tags (e.g., contained in, preceding, following), and not the tags in themselves. Obviously this is not too trivial because of all these nifty exceptions that are allowed in HTML. Maybe it would be good to divide the parsing phase into a "candidate recognition" phase (purely regex based) and a "HTML parsing" phase, where you would expand the snippet found to canonical HTML syntax.

Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com

  • Comment on RE: RE: RE: A grammar for HTML matching

Replies are listed 'Best First'.
XML::XPath, anyone?
by merlyn (Sage) on Nov 02, 2000 at 19:00 UTC
    And then you have to have a language to express the relationships between the pieces once you've parsed it.

    I'm currently interested in some discussions about using XML::XPath for that language to specify the matches. So, you'd parse HTML into something acceptable as an XPath object, then use XPath's language to pick out the items of interest, then wander that back out as your result.

    -- Randal L. Schwartz, Perl hacker