Re^4: UTF-8 Malformed Char Error -- how to find and remove bad chars

How does it look in the file? I think Perl Monks screwed up your first post. &reg; is double encoded; the ampersand on the entity reference was encoded to &. It should be decoded to '®'.

If it is ®, then I am guessing you are using a DTD with the HTML entities defined. The parser should map the entity ® to ® both of which are displayed as ® and ®.

The problem is if the parser is not converting the character to UTF-8. It seems to be marking the string as Unicode.

What version of Perl are you using? Perl 5.6 has some bugs in Unicode handling. What version of XML::Twig and XML::Parser are you using? Try running the following code to see how your Perl handles the character.

use Encode;

my $unicode = "\xAE";
print length($unicode), "\n";
print ord(substr($unicode, $_, 1)), "\n"
   for 0 .. length($unicode) - 1;

print $unicode, "\n";

my $bytes = Encode::encode('utf8', $unicode);
print length($bytes);
print ord(substr($bytes, $_, 1)), "\n"
   for 0 .. length($bytes) - 1;

print $bytes, "\n";
[download]

Comment on Re^4: UTF-8 Malformed Char Error -- how to find and remove bad chars Select or Download Code