in reply to Filtering out bad UTF8 chars
decode already handles bad UTF-8.
$ perl -MEncode -Mcharnames=:full -wE' $bad = "\xE9abc"; say sprintf "U+%04X %s", ord, charnames::viacode(ord) for split //, decode("UTF-8", $bad); ' U+FFFD REPLACEMENT CHARACTER U+0061 LATIN SMALL LETTER A U+0062 LATIN SMALL LETTER B U+0063 LATIN SMALL LETTER C
It doesn't remove bad characters, but replaces them with U+FFFD. You could play with decode's third arg, or you could simply strip out the replacement character aftewards.
s/\x{FFFD}//g;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Filtering out bad UTF8 chars
by FreakyGreenLeaky (Sexton) on Oct 13, 2011 at 09:55 UTC | |
by ikegami (Patriarch) on Oct 13, 2011 at 14:49 UTC |