Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a quick regex to (imperfectly) entity encode some HTML. I can't seem to figure out how to express "if a defined symbol is not inside of < > then match." He is the code I have now:
my $html = "<b>foo & bar</b>"; my %Map = ( '&' => '&amp;', '"' => '&quot;', '<' => '&lt;', '>' => '&gt;', "'" => '&#39;' ); my $RE = join '|', keys %Map; # This of course encodes everything. #$html=~s!($RE)!$Map{$1}!g; # Returns <b&gt;foo &amp; bar&lt;/b&gt; $html=~s/(?:(?!<\w[^>]*)($RE))/$Map{$1}/g; print $html;
Any wisdom to how to make this work? Thanks.

Replies are listed 'Best First'.
Re: A quick regex to (imperfectly) entity encode some HTML?
by Ovid (Cardinal) on Mar 08, 2003 at 00:30 UTC

    The easy way using HTML::Entities:

    use HTML::Entities; encode_entities( $some_var ); # or decode_entities( $some_var );

    See the documentation for extra features.

    Cheers,
    Ovid

    New address of my CGI Course.
    Silence is Evil (feel free to copy and distribute widely - note copyright text)

      Thanks for the quick reply. This would be the easy way if the text didn't already have (X)HTML tags in it. I'm looking for "wisdom" to working around existing tags efficiently.

        I'm not sure why that would be a problem. Can you decode those entities and then turn around and re-encode them? Further, your code would have the same issue.

        In your root post, you wrote: "if a defined symbol is not inside of < > then match". I'm not sure exactly what you mean. Do you mean that you don't want to encode anything that's already in a tag? The following untested snippet takes the name of an html document as its argument.

        use HTML::TokeParser::Simple; use HTML::Entities; use File::Copy; my $new_html = ''; my $orig_html = shift || die "Usage: $0 some.html"; copy( $orig_html, "${orig_html}.bak") or die "Could not copy ($orig_html): $!"; my $parser = HTML::TokeParser::Simple->new($orig_html); while (my $token = $parser->get_token) { if ($token->is_tag) { $new_html .= $token->as_is; next; } $new_html .= encode_entities($token->as_is); } open OUTPUT "> $orig_html" or die "Cannot open ($orig_html) for writi +ng: $!"; print OUTPUT $new_html; close OUTPUT;

        The above code is untested. Further, if you have $HTML::Parser::VERSION < 3.25, this will not parse XTHML correctly.

        Cheers,
        Ovid

        New address of my CGI Course.
        Silence is Evil (feel free to copy and distribute widely - note copyright text)