in reply to Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
in thread What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Thank you very much, Bart.

As I wrote in my inquiry, "I know each file is in one of exactly two different character encodings: Windows-1252 or UTF-8." So I don't have to worry about the various ISO-8859 character sets.

As I mentioned, "I considered using Encode::Guess, but rejected it because it seems hinky." I read criticism of it that suggested it's no good at doing precisely what I need to do: simply to distinguish between Windows-1252 and UTF-8 character encodings in text that is predominantly in the Latin script—mostly in English with incidental text in other Western European languages.

Jim

  • Comment on Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Replies are listed 'Best First'.
Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by bart (Canon) on Jun 23, 2011 at 11:37 UTC
    Well then here's how I'd do it. I'd check the whole file for UTF-8 sequences and any other bytes with value 128 or above.
    • If you find no bytes with value 128-255, then the file is ASCII (or CP-1252 or UTF-8, they're all the same here.)
    • If you only find valid UTF-8 byte sequences then it's probably UTF-8. (If the first sequence is at the start of the file and it's a BOM character, value 0xFEFF, then there is very little doubt about it)
    • If you only find other upper half bytes then it's CP-1252.
    • If you find both, it's more likely that it's CP-1252, but you'd better take a look at it; It could be a corrupt UTF-8 file.
    Code to test this, assuming $_ contains the whole file, and is not converted to utf-8:
    my(%utf8, %single); while(/([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]|[\xF0 +-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])|([\x80-\xFF])/g) { if($1) { $utf8{$1}++; } elsif($2) { $single{$1}++; } }
    (untested)

    If after this code block %single is empty and %utf8 is not empty, then it's UTF-8; if %single is not empty then it's CP-1252 with high certainty if %utf8 is empty. <You can do simpler tests than this one, that don't involve hashes, but this way it's easier to debug and verify why it decided one way, and not another way.