A quick regex to (imperfectly) entity encode some HTML?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a quick regex to (imperfectly) entity encode some HTML. I can't seem to figure out how to express "if a defined symbol is not inside of < > then match." He is the code I have now:

my $html = "<b>foo & bar</b>";
my %Map = ( '&' => '&amp;', '"' => '&quot;', '<' => '&lt;',
 '>' => '&gt;', "'" => '&#39;' );
my $RE = join '|', keys %Map;

# This of course encodes everything.
#$html=~s!($RE)!$Map{$1}!g; 

# Returns <b&gt;foo &amp; bar&lt;/b&gt;
$html=~s/(?:(?!<\w[^>]*)($RE))/$Map{$1}/g; 

print $html;
[download]

Any wisdom to how to make this work? Thanks.

Comment on A quick regex to (imperfectly) entity encode some HTML? Download Code

Replies are listed 'Best First'.
Re: A quick regex to (imperfectly) entity encode some HTML? by Ovid (Cardinal) on Mar 08, 2003 at 00:30 UTC
The easy way using HTML::Entities: `use HTML::Entities; encode_entities( $some_var ); # or decode_entities( $some_var );` [download] See the documentation for extra features. Cheers, Ovid New address of my CGI Course. Silence is Evil (feel free to copy and distribute widely - note copyright text)	[reply] [d/l]
Re: Re: A quick regex to (imperfectly) entity encode some HTML? by Anonymous Monk on Mar 08, 2003 at 00:49 UTC
Thanks for the quick reply. This would be the easy way if the text didn't already have (X)HTML tags in it. I'm looking for "wisdom" to working around existing tags efficiently.	[reply]
Re: Re: Re: A quick regex to (imperfectly) entity encode some HTML? by Ovid (Cardinal) on Mar 08, 2003 at 01:42 UTC
I'm not sure why that would be a problem. Can you decode those entities and then turn around and re-encode them? Further, your code would have the same issue. In your root post, you wrote: "if a defined symbol is not inside of < > then match". I'm not sure exactly what you mean. Do you mean that you don't want to encode anything that's already in a tag? The following untested snippet takes the name of an html document as its argument. use HTML::TokeParser::Simple; use HTML::Entities; use File::Copy; my $new_html = ''; my $orig_html = shift \|\| die "Usage: $0 some.html"; copy( $orig_html, "${orig_html}.bak") or die "Could not copy ($orig_html): $!"; my $parser = HTML::TokeParser::Simple->new($orig_html); while (my $token = $parser->get_token) { if ($token->is_tag) { $new_html .= $token->as_is; next; } $new_html .= encode_entities($token->as_is); } open OUTPUT "> $orig_html" or die "Cannot open ($orig_html) for writi +ng: $!"; print OUTPUT $new_html; close OUTPUT; [download] The above code is untested. Further, if you have `$HTML::Parser::VERSION < 3.25`, this will not parse XTHML correctly. Cheers, Ovid New address of my CGI Course. Silence is Evil (feel free to copy and distribute widely - note copyright text)	[reply] [d/l]
Re: Re: Re: Re: A quick regex to (imperfectly) entity encode some HTML? by Anonymous Monk on Mar 08, 2003 at 06:04 UTC