in reply to Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
in thread What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Well then here's how I'd do it. I'd check the whole file for UTF-8 sequences and any other bytes with value 128 or above. Code to test this, assuming $_ contains the whole file, and is not converted to utf-8:
my(%utf8, %single); while(/([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]|[\xF0 +-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])|([\x80-\xFF])/g) { if($1) { $utf8{$1}++; } elsif($2) { $single{$1}++; } }
(untested)

If after this code block %single is empty and %utf8 is not empty, then it's UTF-8; if %single is not empty then it's CP-1252 with high certainty if %utf8 is empty. <You can do simpler tests than this one, that don't involve hashes, but this way it's easier to debug and verify why it decided one way, and not another way.

  • Comment on Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
  • Download Code