Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Thank you very much, Bart.

As I wrote in my inquiry, "I know each file is in one of exactly two different character encodings: Windows-1252 or UTF-8." So I don't have to worry about the various ISO-8859 character sets.

As I mentioned, "I considered using Encode::Guess, but rejected it because it seems hinky." I read criticism of it that suggested it's no good at doing precisely what I need to do: simply to distinguish between Windows-1252 and UTF-8 character encodings in text that is predominantly in the Latin script—mostly in English with incidental text in other Western European languages.

Jim

Comment on Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Replies are listed 'Best First'.
Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by bart (Canon) on Jun 23, 2011 at 11:37 UTC
Well then here's how I'd do it. I'd check the whole file for UTF-8 sequences and any other bytes with value 128 or above. If you find no bytes with value 128-255, then the file is ASCII (or CP-1252 or UTF-8, they're all the same here.) If you only find valid UTF-8 byte sequences then it's probably UTF-8. (If the first sequence is at the start of the file and it's a BOM character, value 0xFEFF, then there is very little doubt about it) If you only find other upper half bytes then it's CP-1252. If you find both, it's more likely that it's CP-1252, but you'd better take a look at it; It could be a corrupt UTF-8 file. Code to test this, assuming $_ contains the whole file, and is not converted to utf-8: `my(%utf8, %single); while(/([\xC0-\xDF][\x80-\xBF]\|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]\|[\xF0 +-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])\|([\x80-\xFF])/g) { if($1) { $utf8{$1}++; } elsif($2) { $single{$1}++; } }` [download] (untested) If after this code block %single is empty and %utf8 is not empty, then it's UTF-8; if %single is not empty then it's CP-1252 with high certainty if %utf8 is empty. <You can do simpler tests than this one, that don't involve hashes, but this way it's easier to debug and verify why it decided one way, and not another way.	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by bart (Canon) on Jun 23, 2011 at 11:37 UTC

If you find no bytes with value 128-255, then the file is ASCII (or CP-1252 or UTF-8, they're all the same here.)
If you only find valid UTF-8 byte sequences then it's probably UTF-8. (If the first sequence is at the start of the file and it's a BOM character, value 0xFEFF, then there is very little doubt about it)
If you only find other upper half bytes then it's CP-1252.
If you find both, it's more likely that it's CP-1252, but you'd better take a look at it; It could be a corrupt UTF-8 file.

my(%utf8, %single);
while(/([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]|[\xF0
+-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])|([\x80-\xFF])/g) {
    if($1) {
        $utf8{$1}++;
    } elsif($2) {
        $single{$1}++;
    }
}
[download]

If after this code block %single is empty and %utf8 is not empty, then it's UTF-8; if %single is not empty then it's CP-1252 with high certainty if %utf8 is empty. <You can do simpler tests than this one, that don't involve hashes, but this way it's easier to debug and verify why it decided one way, and not another way.

[reply]
[d/l]