Re^2: detect incorrect character encoding

These are valid concerns, but they don't entail "looking at ... semantic context". Just a little a priori valid knowledge about the data can suffice to make "guesses" qualify as correct decisions.

For example, the OP seems to know that the possible encodings are bound to be either utf8 or cp1252. If it's also known that all the data are, say, in English, then the predominant evidence for cp1256 data will be the various "smart quotes" and other specialized punctuation marks that sit in the range between 0x80 - 0x9f; these are virtually gauranteed to cause utf8 parsing errors.

Given the OP's premise, if a file fails to parse as utf8, it's either corrupted or else cp1252, and some simple statistics on actual vs. expected byte value frequencies can generally resolve between those two possibilities.

As you point out, if a string can be parsed as utf8, there's an outside chance that it could be some other encoding, and all the High-Bit-Set bytes just happen to occur in groups that are parsable as utf8 wide characters. Honestly, the odds of this actually happening in any sort of natural language data are slim to the point of falling between negligible and impossible, and text that are truly ambiguous in this regard only occur when they deliberately constructed to be ambiguous.

It turns out that the example you constructed was incorrect: 0xA8 is the diaresis mark in cp1252; the right-double-quote is 0x94. It's true that the byte sequence "\xC4\x94" can be parsed as the utf8 character "LATIN CAPITAL LETTER E WITH BREVE" (U+0114 -- Ĕ -- quite a rare beast ~~not displayable on PM's Latin1-based pages~~).

In any case, if such a text was in fact cp1252, then the use of 0x94 as a right (close) quote would tend to correlate with the use of 0x93 as the left (open) quote, and this would surely cause a utf8 parse error, because it will normally be preceded by a space or be string-initial (see the "Unicode Encodings" section of perlunicode for details on why this would violate utf8 encoding).

Comment on Re^2: detect incorrect character encoding

Replies are listed 'Best First'.
Re^3: detect incorrect character encoding by almut (Canon) on Jan 03, 2007 at 16:05 UTC
Well spotted graff, the \xA8 is in fact the diaresis mark. My bad. Anyway, I was just trying to point out that in general it does not follow from correctly parsing as UTF-8 that the text in question had originally also been created as such. Sure, if you can make a priori assumptions about the content, this might not be a problem in the specific case. In my attempt to come up with an example, I quickly listed relevant sequences with `use Encode; for my $codepoint (0x80..0xffff) { my $utf8 = pack "U", $codepoint; Encode::_utf8_off($utf8); printf "U+%04x %s '%s'\n", $codepoint, unpack("H*",$utf8), $utf8; }` [download] In my cursory scan of the output, I obviously picked a suboptimal example, because (as rendered by my terminal font) the diaresis looked just like the double-quote to my eyes (which are still somewhat swollen at 6 a.m. in the morning -- Almut reminds herself to not post to public forums at this time of the day, or at least to apply some basic sanity checks in advance ;). Looking at it now, it seems the glyph is about one pixel shorter vertically... Oh well.	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: detect incorrect character encoding
by almut (Canon) on Jan 03, 2007 at 16:05 UTC

Well spotted graff, the \xA8 is in fact the diaresis mark. My bad.

Anyway, I was just trying to point out that in general it does not follow from correctly parsing as UTF-8 that the text in question had originally also been created as such. Sure, if you can make a priori assumptions about the content, this might not be a problem in the specific case.

In my attempt to come up with an example, I quickly listed relevant sequences with

use Encode;
for my $codepoint (0x80..0xffff) {
    my $utf8 = pack "U", $codepoint;  Encode::_utf8_off($utf8);
    printf "U+%04x %s '%s'\n", $codepoint, unpack("H*",$utf8), $utf8;
}
[download]

In my cursory scan of the output, I obviously picked a suboptimal example, because (as rendered by my terminal font) the diaresis looked just like the double-quote to my eyes (which are still somewhat swollen at 6 a.m. in the morning -- Almut reminds herself to not post to public forums at this time of the day, or at least to apply some basic sanity checks in advance ;). Looking at it now, it seems the glyph is about one pixel shorter vertically... Oh well.

[reply]
[d/l]