Errto has asked for the wisdom of the Perl Monks concerning the following question:

Folks,

I have a large collection of text files which all should be in UTF-8, but some of them are not. I need to determine which ones those are. I am currently trying something like this:

eval { open my $file, '<:utf8', $filename or die $!; local $/; <$file>; }; die "$filename is invalid utf8: $@\n" if $@;
If I pass $filename as a file that is not in UTF-8 (i.e. it's in cp1252 which is my other contender) I get a warning on STDERR but the eval does not die as I would like it to. I would like to know, how can I make it do that?

Replies are listed 'Best First'.
Re: detect incorrect character encoding
by shigetsu (Hermit) on Jan 03, 2007 at 00:05 UTC
    You could be using a signal handler:
    { local $SIG{__WARN__} = sub { die "$filename is invalid utf8!\n" }; eval { .. }; }
Re: detect incorrect character encoding
by graff (Chancellor) on Jan 03, 2007 at 01:49 UTC
    The first reply should work fine, unless your script generates other warnings unrelated to character encoding, and if you are simply testing for utf8 vs. some different (single-byte) encoding, then Encode::Guess should do nicely as well. Here's another way, which involves specifically testing that a given file is encoded correctly as utf8 data (with no errors, corruptions, or use of non-utf8 characters):
    use Encode; my $filename = "whatever"; eval { open my $file, "<:raw", $filename or die $!; local $/; local $_ = <$file>; decode( "utf8", $_, Encode::FB_CROAK ); } die "$filename is invalid utf8: $@\n" if $@;
    For more info on that, check the Encode man page, esp. the section titled "Handling Malformed Data".
Re: detect incorrect character encoding
by bsdz (Friar) on Jan 03, 2007 at 00:26 UTC
    I did something similar using Encode::Guess recently. It is bundled with the latest version of Perl.
Re: detect incorrect character encoding
by almut (Canon) on Jan 03, 2007 at 05:24 UTC

    I general, testing for UTF-8 well-formedness is not necessarily a good means to determine the real encoding of a file -- at least it's not perfect. And, even though Encode::Guess does use somewhat more elaborate mechanisms, it's still just a guess (as the name implies, otherwise it would be called Encode::Determine :)

    Especially with texts consisting mostly of plain ASCII, it can be rather difficult to disambiguate between encodings, without looking at quite a lot of (possibly semantic) context... In particular, with CP1252 being a single-byte encoding, essentially any valid UTF-8 byte sequence also is some valid CP1252 text, though many such character combinations could be expected to not be found in real life.

    However, there are a still a number of such ambiguous sequences which are not too unlikely to occur in real world texts written in real world languages.

    For example, the byte sequence c4a8 (hex) represents the two characters Ä" (capital A-umlaut, double-quote) when interpreted in the encoding CP1252 (or Latin1 for that matter). However, this byte sequence also happens to be the UTF-8 representation of the Unicode codepoint U+0128 (name: "LATIN CAPITAL LETTER I WITH TILDE", glyph: Ĩ ).

    So, assuming you had some hypothetical text in CP1252, like

    ... the capital A umlaut "Ä" may cause problems ...

    your detection heuristics would incorrectly flag it as UTF-8 (as it's perfectly well-formed), which would render the text's semantics into some nonsense like

    ... the capital A umlaut "Ĩ may cause problems ...

    IOW, don't blindly trust mere guesses... Just a friendly word of caution.

    Update: As pointed out by graff, it turns out the above example is incorrect... but I think the basic message is clear.

    Instead of wasting my time on finding a better example, I'll leave it to the interested reader to decide for themselves, whether any of the 65408 potentially critical character combinations (leaving out the 4-byte sequences) might cause problems for them. The construction principle would be (i.e. those parse as valid UTF-8, leaving aside any peculiarities for the moment):

    • for 2-byte sequences:  first character from the range c2-df, second character from the range 80-bf
    • for 3-byte sequences:  first character from the range e0-ef, second from the range a0-bf, and third from the range 80-bf

    (Table of all CP1252 characters for example here)

      These are valid concerns, but they don't entail "looking at ... semantic context". Just a little a priori valid knowledge about the data can suffice to make "guesses" qualify as correct decisions.

      For example, the OP seems to know that the possible encodings are bound to be either utf8 or cp1252. If it's also known that all the data are, say, in English, then the predominant evidence for cp1256 data will be the various "smart quotes" and other specialized punctuation marks that sit in the range between 0x80 - 0x9f; these are virtually gauranteed to cause utf8 parsing errors.

      Given the OP's premise, if a file fails to parse as utf8, it's either corrupted or else cp1252, and some simple statistics on actual vs. expected byte value frequencies can generally resolve between those two possibilities.

      As you point out, if a string can be parsed as utf8, there's an outside chance that it could be some other encoding, and all the High-Bit-Set bytes just happen to occur in groups that are parsable as utf8 wide characters. Honestly, the odds of this actually happening in any sort of natural language data are slim to the point of falling between negligible and impossible, and text that are truly ambiguous in this regard only occur when they deliberately constructed to be ambiguous.

      It turns out that the example you constructed was incorrect: 0xA8 is the diaresis mark in cp1252; the right-double-quote is 0x94. It's true that the byte sequence "\xC4\x94" can be parsed as the utf8 character "LATIN CAPITAL LETTER E WITH BREVE" (U+0114 -- Ĕ -- quite a rare beast not displayable on PM's Latin1-based pages).

      In any case, if such a text was in fact cp1252, then the use of 0x94 as a right (close) quote would tend to correlate with the use of 0x93 as the left (open) quote, and this would surely cause a utf8 parse error, because it will normally be preceded by a space or be string-initial (see the "Unicode Encodings" section of perlunicode for details on why this would violate utf8 encoding).

        Well spotted graff, the \xA8 is in fact the diaresis mark. My bad.

        Anyway, I was just trying to point out that in general it does not follow from correctly parsing as UTF-8 that the text in question had originally also been created as such. Sure, if you can make a priori assumptions about the content, this might not be a problem in the specific case.

        In my attempt to come up with an example, I quickly listed relevant sequences with

        use Encode; for my $codepoint (0x80..0xffff) { my $utf8 = pack "U", $codepoint; Encode::_utf8_off($utf8); printf "U+%04x %s '%s'\n", $codepoint, unpack("H*",$utf8), $utf8; }

        In my cursory scan of the output, I obviously picked a suboptimal example, because (as rendered by my terminal font) the diaresis looked just like the double-quote to my eyes (which are still somewhat swollen at 6 a.m. in the morning -- Almut reminds herself to not post to public forums at this time of the day, or at least to apply some basic sanity checks in advance ;). Looking at it now, it seems the glyph is about one pixel shorter vertically... Oh well.

Re: detect incorrect character encoding
by cub.uanic (Acolyte) on Jan 04, 2007 at 05:42 UTC