in reply to Re^2: UTF-8 Malformed Char Error -- how to find and remove bad chars
in thread UTF-8 Malformed Char Error -- how to find and remove bad chars

How are you processing the text?

XML::Twig

How is it getting encoded to UTF-8 characters?

I don't know, I received the file from elsewhere, and it is one of many, from many sources.

It is really '®' or '®' (and something else is encoding the ampersand to &).

I think it is the ®, (R) symbol.

If the former, you should be double decoding it.

Please explain more what you mean here... thanks!

  • Comment on Re^3: UTF-8 Malformed Char Error -- how to find and remove bad chars

Replies are listed 'Best First'.
Re^4: UTF-8 Malformed Char Error -- how to find and remove bad chars
by iburrell (Chaplain) on Jun 23, 2004 at 18:18 UTC
    How does it look in the file? I think Perl Monks screwed up your first post. ® is double encoded; the ampersand on the entity reference was encoded to &. It should be decoded to '®'.

    If it is ®, then I am guessing you are using a DTD with the HTML entities defined. The parser should map the entity ® to ® both of which are displayed as ® and ®.

    The problem is if the parser is not converting the character to UTF-8. It seems to be marking the string as Unicode.

    What version of Perl are you using? Perl 5.6 has some bugs in Unicode handling. What version of XML::Twig and XML::Parser are you using? Try running the following code to see how your Perl handles the character.

    use Encode; my $unicode = "\xAE"; print length($unicode), "\n"; print ord(substr($unicode, $_, 1)), "\n" for 0 .. length($unicode) - 1; print $unicode, "\n"; my $bytes = Encode::encode('utf8', $unicode); print length($bytes); print ord(substr($bytes, $_, 1)), "\n" for 0 .. length($bytes) - 1; print $bytes, "\n";