water has asked for the wisdom of the Perl Monks concerning the following question:

I'm getting some UTF errors parsing a file using XML Twig:
Assertion (parse: Malformed UTF-8 character (unexpected continuation +byte 0xae, with no preceding start byte) in lc at ...., <DATA> line 2 +20. ) failed!
How do I go about finding the bad character, and stripping it? Thanks for any advice.

water

Replies are listed 'Best First'.
Re: UTF-8 Malformed Char Error -- how to find and remove bad chars
by Roy Johnson (Monsignor) on Jun 23, 2004 at 14:22 UTC
    See this thread. (Super Search for valid utf-8 came up with it.)

    We're not really tightening our belts, it just feels that way because we're getting fatter.
Re: UTF-8 Malformed Char Error -- how to find and remove bad chars
by water (Deacon) on Jun 23, 2004 at 15:00 UTC
    Here's the offending thing (as displayed in a text editor)
    The Widget&amp;reg is the perfect...
    What is this? I know &amp; is the ampersand, and &reg; is the (R) symbol -- is this semi-mangled markup, with the amp 'double encoded'? I'm trying the
    my $octets = encode("utf8", $x, Encode::FB_DEFAULT); $x = decode("utf8", $x, Encode::FB_DEFAULT);
    suggestion and the character is still there. Should I be using  use bytes; as well?

    All are welcome to downvote this node and just tell me to "RTFM", but I've been trying to make sense of the docs, and golly, UTF is hard to understand. For me, at least. Mea culpa.

    I just want to strip this stuff and make it go away from my files.

    utf befuddled --

    water

      How are you processing the text? How is it getting encoded to UTF-8 characters? It is really '&reg;' or '®' (and something else is encoding the ampersand to &). It the former, you should be double decoding it. If the latter, then presumably your XML parser is resolving the entity into the Unicode character. Hopefully, it is marking the Perl string as Unicode.

      UTF-8 is pretty easy to understand. It is a way to encode Unicode characters that can be processed by tools that handle normal C strings. All ASCII characters have the same encoding. Larger characters are encoded in two or more bytes. If you get a malformed char error, it could mean your string was corrupted. More likely is that it isn't a UTF-8 string, but some 8-bit encoding like Latin-1. The proper solution is to translate the other encoding into UTF-8 and let Perl handle it.

        How are you processing the text?

        XML::Twig

        How is it getting encoded to UTF-8 characters?

        I don't know, I received the file from elsewhere, and it is one of many, from many sources.

        It is really '®' or '®' (and something else is encoding the ampersand to &).

        I think it is the ®, (R) symbol.

        If the former, you should be double decoding it.

        Please explain more what you mean here... thanks!

Re: UTF-8 Malformed Char Error -- how to find and remove bad chars
by graff (Chancellor) on Jun 24, 2004 at 03:40 UTC
    To understand this part of the error message:
    unexpected continuation byte 0xae, with no preceding start byte
    you'd want to read the "UTF-8" portion of section 3.9 (pp. 77-78) of the Unicode Book (find it here at the link labeled "3. Conformance").

    I think what it's telling you is that you actually have text data that is using the single-byte "ISO 8859-1" (Latin1) encoding, where the "(R)" symbol is expressed as "0xAE".

    Now, if it were really supposed to be UNICODE text data, the 16-bit code point for that would be "0x00AE", and owing to the way that UTF-8 is designed, it would have to be expressed using two bytes in UTF-8, and the two-byte sequence would actually be "0xC2 0xAE" -- so the error message is simply saying that the initial "0xC2" byte isn't there.

    So the problem would seem to be that your script is assuming that it is getting utf8 data, when in fact this file contains an ISO 8859-1 single-byte character. To get it to scan properly as utf8, you need to "decode" it out of 8859-1:

    use Encode; ... my $utf8_version = decode("iso8859-1", $orig_version); ...
    Of course, if you just want to get rid of the nasty little booger, and make sure your data is nothing but ASCII:
    ... # assume that offending text is in $_ use bytes; tr/[\x01-\x7f]//cd; # delete any byte with hi-bit set ...