Re^2: Matching HTML Tags

Replies are listed 'Best First'.
Re^3: Matching HTML Tags by tilly (Archbishop) on May 24, 2005 at 01:51 UTC
Go back to your original problem, it is simpler and saner. Don't try to parse HTML with regular expressions. You will always almost find a partial solution, but it will break in more and more esoteric ways. Instead do something like this: `use HTML::TokeParser::Simple; # time passes... my $parser = HTML::TokeParser::Simple->new( string => $original_html ) +; while (my $token = $parser->get_token) { $token->rewrite_tag; if ($token->is_text) { my $t = $token->as_is; $t =~ s/"/&quote;/g; # More generically you could URI::Escape print $t; } else { print $token->as_is; } }` [download] and now your problem is solved. Well, unless you have JavaScript in the HTML, in which case you have another world of grief to deal with, though you can start with: use HTML::TokeParser::Simple; # time passes... my $parser = HTML::TokeParser::Simple->new( string => $original_html ) +; while (my $token = $parser->get_token) { $token->rewrite_tag; if ($token->is_text) { my $t = $token->as_is; $t =~ s/"/&quote;/g; print $t; } elsif ($token->is_start_tag("script")) { # Embedded JavaScript, do not mess up! print $token->as_is; while ($token = $parser->get_token) { print $token->as_is; if ($token->is_tag("script") and not $token->is_start_tag) { last; # done with JavaScript } } } else { print $token->as_is; } } [download]	[reply] [d/l] [select]
Re^3: Matching HTML Tags by cees (Curate) on May 24, 2005 at 04:29 UTC
I had to do the same thing a couple of days ago. Here is my HTML::Parser solution (to complement tilly's HTML::TokeParser version above). It encodes any special characters found in the text portion of an HTML doc. `use HTML::Parser; use HTML::Entities; my $html = '<div align="center">Your "HTML" page goes here</div>'; my $enc = ''; my $p = HTML::Parser->new( unbroken_text => 1, default_h => [ sub { $enc .= join('', @_) }, "text" ], text_h => [ sub { $enc .= HTML::Entities::encode_entities($_[0]) }, +"text" ], ); $p->parse($html); print $encoded;` [download] Handling JavaScript is left as an exercise for the reader ;) - Cees	[reply] [d/l]