Re: Filtering out bad UTF8 chars

decode already handles bad UTF-8.

$ perl -MEncode -Mcharnames=:full -wE'
   $bad = "\xE9abc";
   say sprintf "U+%04X %s", ord, charnames::viacode(ord)
      for split //, decode("UTF-8", $bad);
'
U+FFFD REPLACEMENT CHARACTER
U+0061 LATIN SMALL LETTER A
U+0062 LATIN SMALL LETTER B
U+0063 LATIN SMALL LETTER C
[download]

It doesn't remove bad characters, but replaces them with U+FFFD. You could play with decode's third arg, or you could simply strip out the replacement character aftewards.

s/\x{FFFD}//g;
[download]

Comment on Re: Filtering out bad UTF8 chars Select or Download Code

Replies are listed 'Best First'.
Re^2: Filtering out bad UTF8 chars by FreakyGreenLeaky (Sexton) on Oct 13, 2011 at 09:55 UTC
Thanks for the reply ikegami - I then get a `Cannot decode string with wide characters at /usr/lib64/perl5/Encode.pm line 174.` error, presumably because the text is already decoded, and I'm double-decoding (if I understand correctly). My problem is I have input from wildly varying sources (websites) with correspondingly wildly varying encodings... I think until I can find a way to handle these scenarios without crashing the backend, I'm going to not try and extract what can be extracted and simply skip these damn files. Luckily they're in the extreme minority and as much as it irks me to do this, I'm flagging this #TODO for now.	[reply] [d/l]
Re^3: Filtering out bad UTF8 chars by ikegami (Patriarch) on Oct 13, 2011 at 14:49 UTC
My problem is I have input from wildly varying sources (websites) with correspondingly wildly varying encodings... But you asked about bad UTF-8?! Sorry, I don't understand your question at all.	[reply]