Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

There are byte sequences that are typical for UTF-8. The first byte of a UTF-8 character must be in the range 0xC0-0xF7 (0xC0-0xDF for 2 bytes; 0xE0-0xEF for 3 bytes; and 0xF0-0xF7 for 4 byte sequences), and all the next bytes are in the range 0x80-0xBF. So if you see an accented character that is not part of such a sequence, you simply know it's not UTF-8. You might guess it's probably IS0-Latin-1 (= ISO-8859-1) or Microsoft's extension of it, the Windows character set AKA CP-1252; but that's not necessarily the case. It could be DOS text, for example... or ISO-8859-15.

You could use heuristical/statistical methods and simply base a guess on the frequency of occurence of bytes (the repertoire) what kind of encoding it is, for example in a French text you'll find lots of "é", "è", "ê", "à" and "ç", but something like "þ" will be extremely rare.

I'm guessing there will also be modules to help you, like Encode::Guess, but I've never used it. I haven't had the need for it, thus far, but it might be better than trying to come up with something elaborate yourself. On the other hand, this particular module is focused on Far Eastern encodings (for Japanese and Chinese, among others) so it might not be the best fit for your purpose.

References:

Czyborra.com: Unicode Transformation Formats — UTF-8
unicode.org FAQ: Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?).

Comment on Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Replies are listed 'Best First'.
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 15:40 UTC
Thank you very much, Bart. As I wrote in my inquiry, "I know each file is in one of exactly two different character encodings: Windows-1252 or UTF-8." So I don't have to worry about the various ISO-8859 character sets. As I mentioned, "I considered using Encode::Guess, but rejected it because it seems hinky." I read criticism of it that suggested it's no good at doing precisely what I need to do: simply to distinguish between Windows-1252 and UTF-8 character encodings in text that is predominantly in the Latin script—mostly in English with incidental text in other Western European languages. Jim	[reply]
Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by bart (Canon) on Jun 23, 2011 at 11:37 UTC
Well then here's how I'd do it. I'd check the whole file for UTF-8 sequences and any other bytes with value 128 or above. If you find no bytes with value 128-255, then the file is ASCII (or CP-1252 or UTF-8, they're all the same here.) If you only find valid UTF-8 byte sequences then it's probably UTF-8. (If the first sequence is at the start of the file and it's a BOM character, value 0xFEFF, then there is very little doubt about it) If you only find other upper half bytes then it's CP-1252. If you find both, it's more likely that it's CP-1252, but you'd better take a look at it; It could be a corrupt UTF-8 file. Code to test this, assuming $_ contains the whole file, and is not converted to utf-8: `my(%utf8, %single); while(/([\xC0-\xDF][\x80-\xBF]\|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]\|[\xF0 +-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])\|([\x80-\xFF])/g) { if($1) { $utf8{$1}++; } elsif($2) { $single{$1}++; } }` [download] (untested) If after this code block %single is empty and %utf8 is not empty, then it's UTF-8; if %single is not empty then it's CP-1252 with high certainty if %utf8 is empty. <You can do simpler tests than this one, that don't involve hashes, but this way it's easier to debug and verify why it decided one way, and not another way.	[reply] [d/l]