unexpected continuation byte 0xae, with no preceding start byteyou'd want to read the "UTF-8" portion of section 3.9 (pp. 77-78) of the Unicode Book (find it here at the link labeled "3. Conformance").
I think what it's telling you is that you actually have text data that is using the single-byte "ISO 8859-1" (Latin1) encoding, where the "(R)" symbol is expressed as "0xAE".
Now, if it were really supposed to be UNICODE text data, the 16-bit code point for that would be "0x00AE", and owing to the way that UTF-8 is designed, it would have to be expressed using two bytes in UTF-8, and the two-byte sequence would actually be "0xC2 0xAE" -- so the error message is simply saying that the initial "0xC2" byte isn't there.
So the problem would seem to be that your script is assuming that it is getting utf8 data, when in fact this file contains an ISO 8859-1 single-byte character. To get it to scan properly as utf8, you need to "decode" it out of 8859-1:
Of course, if you just want to get rid of the nasty little booger, and make sure your data is nothing but ASCII:use Encode; ... my $utf8_version = decode("iso8859-1", $orig_version); ...
... # assume that offending text is in $_ use bytes; tr/[\x01-\x7f]//cd; # delete any byte with hi-bit set ...
In reply to Re: UTF-8 Malformed Char Error -- how to find and remove bad chars
by graff
in thread UTF-8 Malformed Char Error -- how to find and remove bad chars
by water
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |