in reply to Re^2: Matching HTML Tags
in thread Matching HTML Tags
Don't try to parse HTML with regular expressions. You will always almost find a partial solution, but it will break in more and more esoteric ways.
Instead do something like this:
and now your problem is solved. Well, unless you have JavaScript in the HTML, in which case you have another world of grief to deal with, though you can start with:use HTML::TokeParser::Simple; # time passes... my $parser = HTML::TokeParser::Simple->new( string => $original_html ) +; while (my $token = $parser->get_token) { $token->rewrite_tag; if ($token->is_text) { my $t = $token->as_is; $t =~ s/"/"e;/g; # More generically you could URI::Escape print $t; } else { print $token->as_is; } }
use HTML::TokeParser::Simple; # time passes... my $parser = HTML::TokeParser::Simple->new( string => $original_html ) +; while (my $token = $parser->get_token) { $token->rewrite_tag; if ($token->is_text) { my $t = $token->as_is; $t =~ s/"/"e;/g; print $t; } elsif ($token->is_start_tag("script")) { # Embedded JavaScript, do not mess up! print $token->as_is; while ($token = $parser->get_token) { print $token->as_is; if ($token->is_tag("script") and not $token->is_start_tag) { last; # done with JavaScript } } } else { print $token->as_is; } }
|
|---|