in reply to detect incorrect character encoding
I general, testing for UTF-8 well-formedness is not necessarily a good means to determine the real encoding of a file -- at least it's not perfect. And, even though Encode::Guess does use somewhat more elaborate mechanisms, it's still just a guess (as the name implies, otherwise it would be called Encode::Determine :)
Especially with texts consisting mostly of plain ASCII, it can be rather difficult to disambiguate between encodings, without looking at quite a lot of (possibly semantic) context... In particular, with CP1252 being a single-byte encoding, essentially any valid UTF-8 byte sequence also is some valid CP1252 text, though many such character combinations could be expected to not be found in real life.
However, there are a still a number of such ambiguous sequences which are not too unlikely to occur in real world texts written in real world languages.
For example, the byte sequence c4a8 (hex) represents the two characters Ä" (capital A-umlaut, double-quote) when interpreted in the encoding CP1252 (or Latin1 for that matter). However, this byte sequence also happens to be the UTF-8 representation of the Unicode codepoint U+0128 (name: "LATIN CAPITAL LETTER I WITH TILDE", glyph: Ĩ ).
So, assuming you had some hypothetical text in CP1252, like
... the capital A umlaut "Ä" may cause problems ...
your detection heuristics would incorrectly flag it as UTF-8 (as it's perfectly well-formed), which would render the text's semantics into some nonsense like
... the capital A umlaut "Ĩ may cause problems ...
IOW, don't blindly trust mere guesses... Just a friendly word of caution.
Update: As pointed out by graff, it turns out the above example is incorrect... but I think the basic message is clear.
Instead of wasting my time on finding a better example, I'll leave it to the interested reader to decide for themselves, whether any of the 65408 potentially critical character combinations (leaving out the 4-byte sequences) might cause problems for them. The construction principle would be (i.e. those parse as valid UTF-8, leaving aside any peculiarities for the moment):
(Table of all CP1252 characters for example here)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: detect incorrect character encoding
by graff (Chancellor) on Jan 03, 2007 at 07:52 UTC | |
by almut (Canon) on Jan 03, 2007 at 16:05 UTC |