Re^3: detect incorrect character encoding

Well spotted graff, the \xA8 is in fact the diaresis mark. My bad.

Anyway, I was just trying to point out that in general it does not follow from correctly parsing as UTF-8 that the text in question had originally also been created as such. Sure, if you can make a priori assumptions about the content, this might not be a problem in the specific case.

In my attempt to come up with an example, I quickly listed relevant sequences with

use Encode;
for my $codepoint (0x80..0xffff) {
    my $utf8 = pack "U", $codepoint;  Encode::_utf8_off($utf8);
    printf "U+%04x %s '%s'\n", $codepoint, unpack("H*",$utf8), $utf8;
}
[download]

In my cursory scan of the output, I obviously picked a suboptimal example, because (as rendered by my terminal font) the diaresis looked just like the double-quote to my eyes (which are still somewhat swollen at 6 a.m. in the morning -- Almut reminds herself to not post to public forums at this time of the day, or at least to apply some basic sanity checks in advance ;). Looking at it now, it seems the glyph is about one pixel shorter vertically... Oh well.

Comment on Re^3: detect incorrect character encoding Download Code