in reply to Re: UTF-8 Malformed Char Error -- how to find and remove bad chars
in thread UTF-8 Malformed Char Error -- how to find and remove bad chars

How are you processing the text? How is it getting encoded to UTF-8 characters? It is really '®' or '®' (and something else is encoding the ampersand to &). It the former, you should be double decoding it. If the latter, then presumably your XML parser is resolving the entity into the Unicode character. Hopefully, it is marking the Perl string as Unicode.

UTF-8 is pretty easy to understand. It is a way to encode Unicode characters that can be processed by tools that handle normal C strings. All ASCII characters have the same encoding. Larger characters are encoded in two or more bytes. If you get a malformed char error, it could mean your string was corrupted. More likely is that it isn't a UTF-8 string, but some 8-bit encoding like Latin-1. The proper solution is to translate the other encoding into UTF-8 and let Perl handle it.

  • Comment on Re^2: UTF-8 Malformed Char Error -- how to find and remove bad chars

Replies are listed 'Best First'.
Re^3: UTF-8 Malformed Char Error -- how to find and remove bad chars
by water (Deacon) on Jun 23, 2004 at 16:56 UTC
    How are you processing the text?

    XML::Twig

    How is it getting encoded to UTF-8 characters?

    I don't know, I received the file from elsewhere, and it is one of many, from many sources.

    It is really '®' or '®' (and something else is encoding the ampersand to &).

    I think it is the ®, (R) symbol.

    If the former, you should be double decoding it.

    Please explain more what you mean here... thanks!

      How does it look in the file? I think Perl Monks screwed up your first post. ® is double encoded; the ampersand on the entity reference was encoded to &. It should be decoded to '®'.

      If it is ®, then I am guessing you are using a DTD with the HTML entities defined. The parser should map the entity ® to ® both of which are displayed as ® and ®.

      The problem is if the parser is not converting the character to UTF-8. It seems to be marking the string as Unicode.

      What version of Perl are you using? Perl 5.6 has some bugs in Unicode handling. What version of XML::Twig and XML::Parser are you using? Try running the following code to see how your Perl handles the character.

      use Encode; my $unicode = "\xAE"; print length($unicode), "\n"; print ord(substr($unicode, $_, 1)), "\n" for 0 .. length($unicode) - 1; print $unicode, "\n"; my $bytes = Encode::encode('utf8', $unicode); print length($bytes); print ord(substr($bytes, $_, 1)), "\n" for 0 .. length($bytes) - 1; print $bytes, "\n";