in reply to Re^3: UTF-8 Malformed Char Error -- how to find and remove bad chars
in thread UTF-8 Malformed Char Error -- how to find and remove bad chars

How does it look in the file? I think Perl Monks screwed up your first post. ® is double encoded; the ampersand on the entity reference was encoded to &. It should be decoded to '®'.

If it is ®, then I am guessing you are using a DTD with the HTML entities defined. The parser should map the entity ® to ® both of which are displayed as ® and ®.

The problem is if the parser is not converting the character to UTF-8. It seems to be marking the string as Unicode.

What version of Perl are you using? Perl 5.6 has some bugs in Unicode handling. What version of XML::Twig and XML::Parser are you using? Try running the following code to see how your Perl handles the character.

use Encode; my $unicode = "\xAE"; print length($unicode), "\n"; print ord(substr($unicode, $_, 1)), "\n" for 0 .. length($unicode) - 1; print $unicode, "\n"; my $bytes = Encode::encode('utf8', $unicode); print length($bytes); print ord(substr($bytes, $_, 1)), "\n" for 0 .. length($bytes) - 1; print $bytes, "\n";