in reply to Removing Unsafe Characters

Something like this seems to work with both character/unicode strings and legacy ISO-Latin-1 input:

use HTML::Entities; my $input = "abc < ä > <p> Ö & ü xyz "; # ISO-Latin-1 my $encoded; $encoded = encode_entities($input, "\xA0-\x{FFFD}"); print "$encoded\n"; # now upgrade $input to character string (utf8) # (by appending some unicode characters) $input .= "\x{5555} \x{8888}"; $encoded = encode_entities($input, "\xA0-\x{FFFD}"); print "$encoded\n";

which would print:

abc < &auml; > <p> &Ouml; & &uuml; xyz abc < &auml; > <p> &Ouml; & &uuml; xyz &#x5555; &#x8888;

(...at least for characters up to \x{FFFD})

Hint: this works because HTML::Entities-internally this is simply turned into the regex substitution:

s/([\xA0-\x{FFFD}])/$char2entity{$1} || num_entity($1)/ge;

Update: the character class could in principle also be extended to cover the "surrogates range" (aka supplementary characters), which would then be "\xA0-\x{FFFD}\x{10000}-\x{10FFFD}"   (IIRC)

Update 2: Note that this would properly encode unicode characters as the corresponding HTML entities. Whether the browser then has the appropriate fonts to render those characters correctly, is another matter (but these days, browsers are able to render quite a lot of unicode characters properly, even with the default configuration).  Also note that the result this achieves is different from simply sending UTF-8 encoded pages to the browser without declaring them as such (which would produce garbage...).

In case you'd rather want to convert any byte value with the high bit set (80-FF) into its ISO-Latin-1 entity representation (which I think is what you wanted to do originally), you'd have to make sure that Perl always treats your input strings as bytes (i.e. utf8 flag off) — but that would be a suboptimal solution, IMO, as you'd misrepresent unicode characters (which are still recognized as such in your input) as sequences of inappropriate characters from the ISO-Latin-1 range...

Update 3: (last one, promised :)  It seems a complementary/exclusion character class (using ^) works as well, e.g.

$encoded = encode_entities($input, "^\x20-\x7E"); # do not encode pri +ntable ASCII chars

That way you wouldn't need to worry about what the correct positive set is...  (This is undocumented, though, so no guarantees!)

Replies are listed 'Best First'.
Re^2: Removing Unsafe Characters
by Praethen (Scribe) on Apr 30, 2009 at 02:17 UTC

    Part 1: $encoded = encode_entities($input, "\xA0-\x{FFFD}"); -- Sadly it didn't work.

    I then began to try to investigate the actual encoding used for the files. Maybe if I can figure out that, then I can figure out how to properly convert them.

    I don't have File::MMagic as suggested at How do I determine encoding format of a file ? but I do have Encode::Guess, I got that running and immediately got Unknown encoding error exactly at the place where I have a garbage character. When running Encode::Guess on the data as a string (instead of an array) I got No appropriate encodings found!

    I focused in on this character, maybe it could give some clues as to my problem. I used the ord() function to try and isolate the character. Two characters return junk and their decimal equivalents are 226 and 128. The 226 is valid but 128 isn't. To top all of that, I'm positive that the user's intended character was a hyphen.

    I feel even more lost than when I started. None of the solutions provided work properly, I either get more junk characters or I get valid characters that shouldn't be there at all.

    I think I'll give up on this question and try and chase down how to determine what the character encoding is on these files. The problem is I have 40,000+ files, how many different encodings could there be? (I'm guessing a few)