UTF-8 Malformed Char Error -- how to find and remove bad chars

water has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: UTF-8 Malformed Char Error -- how to find and remove bad chars by Roy Johnson (Monsignor) on Jun 23, 2004 at 14:22 UTC
See this thread. (Super Search for `valid utf-8` came up with it.) We're not really tightening our belts, it just feels that way because we're getting fatter.	[reply] [d/l]
Re: UTF-8 Malformed Char Error -- how to find and remove bad chars by water (Deacon) on Jun 23, 2004 at 15:00 UTC
Here's the offending thing (as displayed in a text editor) `The Widget&reg is the perfect...` [download] What is this? I know `&` is the ampersand, and `®` is the (R) symbol -- is this semi-mangled markup, with the amp 'double encoded'? I'm trying the `my $octets = encode("utf8", $x, Encode::FB_DEFAULT); $x = decode("utf8", $x, Encode::FB_DEFAULT);` [download] suggestion and the character is still there. Should I be using `use bytes;` as well? All are welcome to downvote this node and just tell me to "RTFM", but I've been trying to make sense of the docs, and golly, UTF is hard to understand. For me, at least. Mea culpa. I just want to strip this stuff and make it go away from my files. utf befuddled -- water	[reply] [d/l] [select]
Re^2: UTF-8 Malformed Char Error -- how to find and remove bad chars by iburrell (Chaplain) on Jun 23, 2004 at 16:19 UTC
How are you processing the text? How is it getting encoded to UTF-8 characters? It is really '®' or '®' (and something else is encoding the ampersand to &). It the former, you should be double decoding it. If the latter, then presumably your XML parser is resolving the entity into the Unicode character. Hopefully, it is marking the Perl string as Unicode. UTF-8 is pretty easy to understand. It is a way to encode Unicode characters that can be processed by tools that handle normal C strings. All ASCII characters have the same encoding. Larger characters are encoded in two or more bytes. If you get a malformed char error, it could mean your string was corrupted. More likely is that it isn't a UTF-8 string, but some 8-bit encoding like Latin-1. The proper solution is to translate the other encoding into UTF-8 and let Perl handle it.	[reply]
Re^3: UTF-8 Malformed Char Error -- how to find and remove bad chars by water (Deacon) on Jun 23, 2004 at 16:56 UTC
How are you processing the text? XML::Twig How is it getting encoded to UTF-8 characters? I don't know, I received the file from elsewhere, and it is one of many, from many sources. It is really '®' or 'Ž' (and something else is encoding the ampersand to &). I think it is the Ž, (R) symbol. If the former, you should be double decoding it. Please explain more what you mean here... thanks!	[reply]
Re^4: UTF-8 Malformed Char Error -- how to find and remove bad chars by iburrell (Chaplain) on Jun 23, 2004 at 18:18 UTC
Re: UTF-8 Malformed Char Error -- how to find and remove bad chars by graff (Chancellor) on Jun 24, 2004 at 03:40 UTC
To understand this part of the error message: unexpected continuation byte 0xae, with no preceding start byte you'd want to read the "UTF-8" portion of section 3.9 (pp. 77-78) of the Unicode Book (find it here at the link labeled "3. Conformance"). I think what it's telling you is that you actually have text data that is using the single-byte "ISO 8859-1" (Latin1) encoding, where the "(R)" symbol is expressed as "0xAE". Now, if it were really supposed to be UNICODE text data, the 16-bit code point for that would be "0x00AE", and owing to the way that UTF-8 is designed, it would have to be expressed using two bytes in UTF-8, and the two-byte sequence would actually be "0xC2 0xAE" -- so the error message is simply saying that the initial "0xC2" byte isn't there. So the problem would seem to be that your script is assuming that it is getting utf8 data, when in fact this file contains an ISO 8859-1 single-byte character. To get it to scan properly as utf8, you need to "decode" it out of 8859-1: `use Encode; ... my $utf8_version = decode("iso8859-1", $orig_version); ...` [download] Of course, if you just want to get rid of the nasty little booger, and make sure your data is nothing but ASCII: `... # assume that offending text is in $_ use bytes; tr/[\x01-\x7f]//cd; # delete any byte with hi-bit set ...` [download]	[reply] [d/l] [select]