comment on

I'm not sure why that would be a problem. Can you decode those entities and then turn around and re-encode them? Further, your code would have the same issue.

In your root post, you wrote: "if a defined symbol is not inside of < > then match". I'm not sure exactly what you mean. Do you mean that you don't want to encode anything that's already in a tag? The following untested snippet takes the name of an html document as its argument.

use HTML::TokeParser::Simple;
use HTML::Entities;
use File::Copy;

my $new_html  = '';
my $orig_html = shift || die "Usage: $0 some.html";

copy( $orig_html, "${orig_html}.bak") 
    or die "Could not copy ($orig_html): $!";

my $parser = HTML::TokeParser::Simple->new($orig_html);

while (my $token = $parser->get_token) {
  if ($token->is_tag) {
    $new_html .= $token->as_is;
    next;
  }
  $new_html .= encode_entities($token->as_is);
}

open  OUTPUT "> $orig_html" or die "Cannot open ($orig_html) for writi
+ng: $!";
print OUTPUT $new_html;
close OUTPUT;
[download]

The above code is untested. Further, if you have $HTML::Parser::VERSION < 3.25, this will not parse XTHML correctly.

Cheers,
Ovid

New address of my CGI Course.
Silence is Evil (feel free to copy and distribute widely - note copyright text)

In reply to Re: Re: Re: A quick regex to (imperfectly) entity encode some HTML? by Ovid
in thread A quick regex to (imperfectly) entity encode some HTML? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.