comment on

To understand this part of the error message:

unexpected continuation byte 0xae, with no preceding start byte

you'd want to read the "UTF-8" portion of section 3.9 (pp. 77-78) of the Unicode Book (find it here at the link labeled "3. Conformance").

I think what it's telling you is that you actually have text data that is using the single-byte "ISO 8859-1" (Latin1) encoding, where the "(R)" symbol is expressed as "0xAE".

Now, if it were really supposed to be UNICODE text data, the 16-bit code point for that would be "0x00AE", and owing to the way that UTF-8 is designed, it would have to be expressed using two bytes in UTF-8, and the two-byte sequence would actually be "0xC2 0xAE" -- so the error message is simply saying that the initial "0xC2" byte isn't there.

So the problem would seem to be that your script is assuming that it is getting utf8 data, when in fact this file contains an ISO 8859-1 single-byte character. To get it to scan properly as utf8, you need to "decode" it out of 8859-1:

use Encode;

...

my $utf8_version = decode("iso8859-1", $orig_version);

...
[download]

Of course, if you just want to get rid of the nasty little booger, and make sure your data is nothing but ASCII:

...

# assume that offending text is in $_

use bytes;
tr/[\x01-\x7f]//cd;  # delete any byte with hi-bit set

...
[download]

In reply to Re: UTF-8 Malformed Char Error -- how to find and remove bad chars by graff
in thread UTF-8 Malformed Char Error -- how to find and remove bad chars by water

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.