This matches "regular" HTML tags -- the part that matches the element may need to be changed slightly, but other than that, it matches: <ELEMENT ( ATTR ( = VALUE )? )* >.
my $open = qr{ < [a-zA-Z][a-zA-Z0-9]* (?: \s+ \w+ (?: \s* = \s* (?: "[^"]*" | '[^']*' | [^\s>]* ) )? )* \s* > }x;
The closing tags are far simpler:
my $close = qr{ < / \s* [a-zA-Z][a-zA-Z0-9]* \s* > }x;
Comments are slightly trickier:
# the following are comments: # <!-- ab -- cd --> <!-- ab --> <!----> # <!-- ab -- cd -- > <!-- ab -- > <!---- > my $comment = qr{ <!-- # <!-- [^-]* # 0 or more non -'s (?: (?! -- \s* > ) # that's not --, space, then > - # a - [^-]* # 0 or more non -'s )* # 0 or more times -- \s* > # --, space, then > }x;
The DTD tag is more difficult. There are specific classes of DTD tags (see the specs). So right onw I don't have a regex to handle them. But combining the other three regexes:
while ($HTML =~ /\G($open|$close|$comment|[^<]+)/g) { # do something with $1 }
Now, using this to create a tree structure of an HTML file shouldn't be too complicated, especially if we use a nice trick like:
# requires the (?{...}) structure use re 'eval'; while ($HTML =~ m{ \G ( $open (?{ $STATE = 'open' }) | $close (?{ $STATE = 'close' }) | $comment (?{ $STATE = 'comment' }) | [^<]+ (?{ $STATE = 'TEXT' }) ) }xg) { # do something with $1 and $STATE }
And you can modify $open and $close to keep track of the element name by putting parens in there.

It's a matter of thoroughness.

japhy -- Perl and Regex Hacker

In reply to Re: Re (tilly) 2: HTML Matching by japhy
in thread HTML Matching by spaz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.