in reply to Re (tilly) 2: HTML Matching
in thread HTML Matching
The closing tags are far simpler:my $open = qr{ < [a-zA-Z][a-zA-Z0-9]* (?: \s+ \w+ (?: \s* = \s* (?: "[^"]*" | '[^']*' | [^\s>]* ) )? )* \s* > }x;
Comments are slightly trickier:my $close = qr{ < / \s* [a-zA-Z][a-zA-Z0-9]* \s* > }x;
The DTD tag is more difficult. There are specific classes of DTD tags (see the specs). So right onw I don't have a regex to handle them. But combining the other three regexes:# the following are comments: # <!-- ab -- cd --> <!-- ab --> <!----> # <!-- ab -- cd -- > <!-- ab -- > <!---- > my $comment = qr{ <!-- # <!-- [^-]* # 0 or more non -'s (?: (?! -- \s* > ) # that's not --, space, then > - # a - [^-]* # 0 or more non -'s )* # 0 or more times -- \s* > # --, space, then > }x;
Now, using this to create a tree structure of an HTML file shouldn't be too complicated, especially if we use a nice trick like:while ($HTML =~ /\G($open|$close|$comment|[^<]+)/g) { # do something with $1 }
And you can modify $open and $close to keep track of the element name by putting parens in there.# requires the (?{...}) structure use re 'eval'; while ($HTML =~ m{ \G ( $open (?{ $STATE = 'open' }) | $close (?{ $STATE = 'close' }) | $comment (?{ $STATE = 'comment' }) | [^<]+ (?{ $STATE = 'TEXT' }) ) }xg) { # do something with $1 and $STATE }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re (tilly) 4: HTML Matching
by tilly (Archbishop) on Nov 19, 2000 at 03:48 UTC |