Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Well then here's how I'd do it. I'd check the whole file for UTF-8 sequences and any other bytes with value 128 or above.

If you find no bytes with value 128-255, then the file is ASCII (or CP-1252 or UTF-8, they're all the same here.)
If you only find valid UTF-8 byte sequences then it's probably UTF-8. (If the first sequence is at the start of the file and it's a BOM character, value 0xFEFF, then there is very little doubt about it)
If you only find other upper half bytes then it's CP-1252.
If you find both, it's more likely that it's CP-1252, but you'd better take a look at it; It could be a corrupt UTF-8 file.

Code to test this, assuming $_ contains the whole file, and is not converted to utf-8:

my(%utf8, %single);
while(/([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]|[\xF0
+-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])|([\x80-\xFF])/g) {
    if($1) {
        $utf8{$1}++;
    } elsif($2) {
        $single{$1}++;
    }
}
[download]

(untested)

If after this code block %single is empty and %utf8 is not empty, then it's UTF-8; if %single is not empty then it's CP-1252 with high certainty if %utf8 is empty. <You can do simpler tests than this one, that don't involve hashes, but this way it's easier to debug and verify why it decided one way, and not another way.

Comment on Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8? Download Code