comment on

I general, testing for UTF-8 well-formedness is not necessarily a good means to determine the real encoding of a file -- at least it's not perfect. And, even though Encode::Guess does use somewhat more elaborate mechanisms, it's still just a guess (as the name implies, otherwise it would be called Encode::Determine :)

Especially with texts consisting mostly of plain ASCII, it can be rather difficult to disambiguate between encodings, without looking at quite a lot of (possibly semantic) context... In particular, with CP1252 being a single-byte encoding, essentially any valid UTF-8 byte sequence also is some valid CP1252 text, though many such character combinations could be expected to not be found in real life.

However, there are a still a number of such ambiguous sequences which are not too unlikely to occur in real world texts written in real world languages.

For example, the byte sequence c4a8 (hex) represents the two characters Ä" (capital A-umlaut, double-quote) when interpreted in the encoding CP1252 (or Latin1 for that matter). However, this byte sequence also happens to be the UTF-8 representation of the Unicode codepoint U+0128 (name: "LATIN CAPITAL LETTER I WITH TILDE", glyph: Ĩ ).

So, assuming you had some hypothetical text in CP1252, like

... the capital A umlaut "Ä" may cause problems ...

your detection heuristics would incorrectly flag it as UTF-8 (as it's perfectly well-formed), which would render the text's semantics into some nonsense like

... the capital A umlaut "Ĩ may cause problems ...

IOW, don't blindly trust mere guesses... Just a friendly word of caution.

Update: As pointed out by graff, it turns out the above example is incorrect... but I think the basic message is clear.

Instead of wasting my time on finding a better example, I'll leave it to the interested reader to decide for themselves, whether any of the 65408 potentially critical character combinations (leaving out the 4-byte sequences) might cause problems for them. The construction principle would be (i.e. those parse as valid UTF-8, leaving aside any peculiarities for the moment):

for 2-byte sequences: first character from the range c2-df, second character from the range 80-bf
for 3-byte sequences: first character from the range e0-ef, second from the range a0-bf, and third from the range 80-bf

(Table of all CP1252 characters for example here)

In reply to Re: detect incorrect character encoding by almut
in thread detect incorrect character encoding by Errto

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.