in reply to UTF-8 Malformed Char Error -- how to find and remove bad chars

Here's the offending thing (as displayed in a text editor)
The Widget&reg is the perfect...
What is this? I know & is the ampersand, and ® is the (R) symbol -- is this semi-mangled markup, with the amp 'double encoded'? I'm trying the
my $octets = encode("utf8", $x, Encode::FB_DEFAULT); $x = decode("utf8", $x, Encode::FB_DEFAULT);
suggestion and the character is still there. Should I be using  use bytes; as well?

All are welcome to downvote this node and just tell me to "RTFM", but I've been trying to make sense of the docs, and golly, UTF is hard to understand. For me, at least. Mea culpa.

I just want to strip this stuff and make it go away from my files.

utf befuddled --

water

Replies are listed 'Best First'.
Re^2: UTF-8 Malformed Char Error -- how to find and remove bad chars
by iburrell (Chaplain) on Jun 23, 2004 at 16:19 UTC
    How are you processing the text? How is it getting encoded to UTF-8 characters? It is really '®' or '®' (and something else is encoding the ampersand to &). It the former, you should be double decoding it. If the latter, then presumably your XML parser is resolving the entity into the Unicode character. Hopefully, it is marking the Perl string as Unicode.

    UTF-8 is pretty easy to understand. It is a way to encode Unicode characters that can be processed by tools that handle normal C strings. All ASCII characters have the same encoding. Larger characters are encoded in two or more bytes. If you get a malformed char error, it could mean your string was corrupted. More likely is that it isn't a UTF-8 string, but some 8-bit encoding like Latin-1. The proper solution is to translate the other encoding into UTF-8 and let Perl handle it.

      How are you processing the text?

      XML::Twig

      How is it getting encoded to UTF-8 characters?

      I don't know, I received the file from elsewhere, and it is one of many, from many sources.

      It is really '®' or '®' (and something else is encoding the ampersand to &).

      I think it is the ®, (R) symbol.

      If the former, you should be double decoding it.

      Please explain more what you mean here... thanks!

        How does it look in the file? I think Perl Monks screwed up your first post. ® is double encoded; the ampersand on the entity reference was encoded to &. It should be decoded to '®'.

        If it is ®, then I am guessing you are using a DTD with the HTML entities defined. The parser should map the entity ® to ® both of which are displayed as ® and ®.

        The problem is if the parser is not converting the character to UTF-8. It seems to be marking the string as Unicode.

        What version of Perl are you using? Perl 5.6 has some bugs in Unicode handling. What version of XML::Twig and XML::Parser are you using? Try running the following code to see how your Perl handles the character.

        use Encode; my $unicode = "\xAE"; print length($unicode), "\n"; print ord(substr($unicode, $_, 1)), "\n" for 0 .. length($unicode) - 1; print $unicode, "\n"; my $bytes = Encode::encode('utf8', $unicode); print length($bytes); print ord(substr($bytes, $_, 1)), "\n" for 0 .. length($bytes) - 1; print $bytes, "\n";