in reply to Re: detect incorrect character encoding
in thread detect incorrect character encoding
For example, the OP seems to know that the possible encodings are bound to be either utf8 or cp1252. If it's also known that all the data are, say, in English, then the predominant evidence for cp1256 data will be the various "smart quotes" and other specialized punctuation marks that sit in the range between 0x80 - 0x9f; these are virtually gauranteed to cause utf8 parsing errors.
Given the OP's premise, if a file fails to parse as utf8, it's either corrupted or else cp1252, and some simple statistics on actual vs. expected byte value frequencies can generally resolve between those two possibilities.
As you point out, if a string can be parsed as utf8, there's an outside chance that it could be some other encoding, and all the High-Bit-Set bytes just happen to occur in groups that are parsable as utf8 wide characters. Honestly, the odds of this actually happening in any sort of natural language data are slim to the point of falling between negligible and impossible, and text that are truly ambiguous in this regard only occur when they deliberately constructed to be ambiguous.
It turns out that the example you constructed was incorrect: 0xA8 is the diaresis mark in cp1252; the right-double-quote is 0x94. It's true that the byte sequence "\xC4\x94" can be parsed as the utf8 character "LATIN CAPITAL LETTER E WITH BREVE" (U+0114 -- Ĕ -- quite a rare beast not displayable on PM's Latin1-based pages).
In any case, if such a text was in fact cp1252, then the use of 0x94 as a right (close) quote would tend to correlate with the use of 0x93 as the left (open) quote, and this would surely cause a utf8 parse error, because it will normally be preceded by a space or be string-initial (see the "Unicode Encodings" section of perlunicode for details on why this would violate utf8 encoding).
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: detect incorrect character encoding
by almut (Canon) on Jan 03, 2007 at 16:05 UTC |