Re^3: UTF-8 Malformed Char Error -- how to find and remove bad chars

How are you processing the text?

XML::Twig

How is it getting encoded to UTF-8 characters?

I don't know, I received the file from elsewhere, and it is one of many, from many sources.

It is really '®' or 'Ž' (and something else is encoding the ampersand to &).

I think it is the Ž, (R) symbol.

If the former, you should be double decoding it.

Please explain more what you mean here... thanks!

Comment on Re^3: UTF-8 Malformed Char Error -- how to find and remove bad chars

Replies are listed 'Best First'.
Re^4: UTF-8 Malformed Char Error -- how to find and remove bad chars by iburrell (Chaplain) on Jun 23, 2004 at 18:18 UTC
How does it look in the file? I think Perl Monks screwed up your first post. `&reg;` is double encoded; the ampersand on the entity reference was encoded to `&`. It should be decoded to '®'. If it is `®`, then I am guessing you are using a DTD with the HTML entities defined. The parser should map the entity `®` to `®` both of which are displayed as ® and ®. The problem is if the parser is not converting the character to UTF-8. It seems to be marking the string as Unicode. What version of Perl are you using? Perl 5.6 has some bugs in Unicode handling. What version of XML::Twig and XML::Parser are you using? Try running the following code to see how your Perl handles the character. `use Encode; my $unicode = "\xAE"; print length($unicode), "\n"; print ord(substr($unicode, $_, 1)), "\n" for 0 .. length($unicode) - 1; print $unicode, "\n"; my $bytes = Encode::encode('utf8', $unicode); print length($bytes); print ord(substr($bytes, $_, 1)), "\n" for 0 .. length($bytes) - 1; print $bytes, "\n";` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^4: UTF-8 Malformed Char Error -- how to find and remove bad chars
by iburrell (Chaplain) on Jun 23, 2004 at 18:18 UTC

&reg;

&

If it is ®, then I am guessing you are using a DTD with the HTML entities defined. The parser should map the entity ® to ® both of which are displayed as ® and ®.

The problem is if the parser is not converting the character to UTF-8. It seems to be marking the string as Unicode.

What version of Perl are you using? Perl 5.6 has some bugs in Unicode handling. What version of XML::Twig and XML::Parser are you using? Try running the following code to see how your Perl handles the character.

use Encode;

my $unicode = "\xAE";
print length($unicode), "\n";
print ord(substr($unicode, $_, 1)), "\n"
   for 0 .. length($unicode) - 1;

print $unicode, "\n";

my $bytes = Encode::encode('utf8', $unicode);
print length($bytes);
print ord(substr($bytes, $_, 1)), "\n"
   for 0 .. length($bytes) - 1;

print $bytes, "\n";
[download]

[reply]
[d/l]
[select]