in reply to Removing Unsafe Characters
Something like this seems to work with both character/unicode strings and legacy ISO-Latin-1 input:
use HTML::Entities; my $input = "abc < ä > <p> Ö & ü xyz "; # ISO-Latin-1 my $encoded; $encoded = encode_entities($input, "\xA0-\x{FFFD}"); print "$encoded\n"; # now upgrade $input to character string (utf8) # (by appending some unicode characters) $input .= "\x{5555} \x{8888}"; $encoded = encode_entities($input, "\xA0-\x{FFFD}"); print "$encoded\n";
which would print:
abc < ä > <p> Ö & ü xyz abc < ä > <p> Ö & ü xyz 啕 袈
(...at least for characters up to \x{FFFD})
Hint: this works because HTML::Entities-internally this is simply turned into the regex substitution:
s/([\xA0-\x{FFFD}])/$char2entity{$1} || num_entity($1)/ge;
Update: the character class could in principle also be extended to cover the "surrogates range" (aka supplementary characters), which would then be "\xA0-\x{FFFD}\x{10000}-\x{10FFFD}" (IIRC)
Update 2: Note that this would properly encode unicode characters as the corresponding HTML entities. Whether the browser then has the appropriate fonts to render those characters correctly, is another matter (but these days, browsers are able to render quite a lot of unicode characters properly, even with the default configuration). Also note that the result this achieves is different from simply sending UTF-8 encoded pages to the browser without declaring them as such (which would produce garbage...).
In case you'd rather want to convert any byte value with the high bit set (80-FF) into its ISO-Latin-1 entity representation (which I think is what you wanted to do originally), you'd have to make sure that Perl always treats your input strings as bytes (i.e. utf8 flag off) — but that would be a suboptimal solution, IMO, as you'd misrepresent unicode characters (which are still recognized as such in your input) as sequences of inappropriate characters from the ISO-Latin-1 range...
Update 3: (last one, promised :) It seems a complementary/exclusion character class (using ^) works as well, e.g.
$encoded = encode_entities($input, "^\x20-\x7E"); # do not encode pri +ntable ASCII chars
That way you wouldn't need to worry about what the correct positive set is... (This is undocumented, though, so no guarantees!)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Removing Unsafe Characters
by Praethen (Scribe) on Apr 30, 2009 at 02:17 UTC |