in reply to Re: A quick regex to (imperfectly) entity encode some HTML?
in thread A quick regex to (imperfectly) entity encode some HTML?

Thanks for the quick reply. This would be the easy way if the text didn't already have (X)HTML tags in it. I'm looking for "wisdom" to working around existing tags efficiently.
  • Comment on Re: Re: A quick regex to (imperfectly) entity encode some HTML?

Replies are listed 'Best First'.
Re: Re: Re: A quick regex to (imperfectly) entity encode some HTML?
by Ovid (Cardinal) on Mar 08, 2003 at 01:42 UTC

    I'm not sure why that would be a problem. Can you decode those entities and then turn around and re-encode them? Further, your code would have the same issue.

    In your root post, you wrote: "if a defined symbol is not inside of < > then match". I'm not sure exactly what you mean. Do you mean that you don't want to encode anything that's already in a tag? The following untested snippet takes the name of an html document as its argument.

    use HTML::TokeParser::Simple; use HTML::Entities; use File::Copy; my $new_html = ''; my $orig_html = shift || die "Usage: $0 some.html"; copy( $orig_html, "${orig_html}.bak") or die "Could not copy ($orig_html): $!"; my $parser = HTML::TokeParser::Simple->new($orig_html); while (my $token = $parser->get_token) { if ($token->is_tag) { $new_html .= $token->as_is; next; } $new_html .= encode_entities($token->as_is); } open OUTPUT "> $orig_html" or die "Cannot open ($orig_html) for writi +ng: $!"; print OUTPUT $new_html; close OUTPUT;

    The above code is untested. Further, if you have $HTML::Parser::VERSION < 3.25, this will not parse XTHML correctly.

    Cheers,
    Ovid

    New address of my CGI Course.
    Silence is Evil (feel free to copy and distribute widely - note copyright text)

      OK. Thanks for your help.