comment on

Something like this seems to work with both character/unicode strings and legacy ISO-Latin-1 input:

use HTML::Entities;

my $input = "abc < ä > <p> Ö & ü xyz ";  # ISO-Latin-1

my $encoded;

$encoded = encode_entities($input, "\xA0-\x{FFFD}");
print "$encoded\n";

# now upgrade $input to character string (utf8)
# (by appending some unicode characters)
$input .= "\x{5555} \x{8888}";

$encoded = encode_entities($input, "\xA0-\x{FFFD}");
print "$encoded\n";
[download]

which would print:

abc < &auml; > <p> &Ouml; & &uuml; xyz 
abc < &auml; > <p> &Ouml; & &uuml; xyz &#x5555; &#x8888;
[download]

(...at least for characters up to \x{FFFD})

Hint: this works because HTML::Entities-internally this is simply turned into the regex substitution:

s/([\xA0-\x{FFFD}])/$char2entity{$1} || num_entity($1)/ge;
[download]

Update: the character class could in principle also be extended to cover the "surrogates range" (aka supplementary characters), which would then be "\xA0-\x{FFFD}\x{10000}-\x{10FFFD}" (IIRC)

Update 2: Note that this would properly encode unicode characters as the corresponding HTML entities. Whether the browser then has the appropriate fonts to render those characters correctly, is another matter (but these days, browsers are able to render quite a lot of unicode characters properly, even with the default configuration). Also note that the result this achieves is different from simply sending UTF-8 encoded pages to the browser without declaring them as such (which would produce garbage...).

In case you'd rather want to convert any byte value with the high bit set (80-FF) into its ISO-Latin-1 entity representation (which I think is what you wanted to do originally), you'd have to make sure that Perl always treats your input strings as bytes (i.e. utf8 flag off) — but that would be a suboptimal solution, IMO, as you'd misrepresent unicode characters (which are still recognized as such in your input) as sequences of inappropriate characters from the ISO-Latin-1 range...

Update 3: (last one, promised :) It seems a complementary/exclusion character class (using ^) works as well, e.g.

$encoded = encode_entities($input, "^\x20-\x7E");  # do not encode pri
+ntable ASCII chars
[download]

That way you wouldn't need to worry about what the correct positive set is... (This is undocumented, though, so no guarantees!)

In reply to Re: Removing Unsafe Characters by almut
in thread Removing Unsafe Characters by Praethen

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.