What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Jim has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by bart (Canon) on Jun 17, 2011 at 11:35 UTC
There are byte sequences that are typical for UTF-8. The first byte of a UTF-8 character must be in the range 0xC0-0xF7 (0xC0-0xDF for 2 bytes; 0xE0-0xEF for 3 bytes; and 0xF0-0xF7 for 4 byte sequences), and all the next bytes are in the range 0x80-0xBF. So if you see an accented character that is not part of such a sequence, you simply know it's not UTF-8. You might guess it's probably IS0-Latin-1 (= ISO-8859-1) or Microsoft's extension of it, the Windows character set AKA CP-1252; but that's not necessarily the case. It could be DOS text, for example... or ISO-8859-15. You could use heuristical/statistical methods and simply base a guess on the frequency of occurence of bytes (the repertoire) what kind of encoding it is, for example in a French text you'll find lots of "é", "è", "ê", "à" and "ç", but something like "þ" will be extremely rare. I'm guessing there will also be modules to help you, like Encode::Guess, but I've never used it. I haven't had the need for it, thus far, but it might be better than trying to come up with something elaborate yourself. On the other hand, this particular module is focused on Far Eastern encodings (for Japanese and Chinese, among others) so it might not be the best fit for your purpose. References: Czyborra.com: Unicode Transformation Formats — UTF-8 unicode.org FAQ: Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?).	[reply]
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 15:40 UTC
Thank you very much, Bart. As I wrote in my inquiry, "I know each file is in one of exactly two different character encodings: Windows-1252 or UTF-8." So I don't have to worry about the various ISO-8859 character sets. As I mentioned, "I considered using Encode::Guess, but rejected it because it seems hinky." I read criticism of it that suggested it's no good at doing precisely what I need to do: simply to distinguish between Windows-1252 and UTF-8 character encodings in text that is predominantly in the Latin script—mostly in English with incidental text in other Western European languages. Jim	[reply]
Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by bart (Canon) on Jun 23, 2011 at 11:37 UTC
Well then here's how I'd do it. I'd check the whole file for UTF-8 sequences and any other bytes with value 128 or above. If you find no bytes with value 128-255, then the file is ASCII (or CP-1252 or UTF-8, they're all the same here.) If you only find valid UTF-8 byte sequences then it's probably UTF-8. (If the first sequence is at the start of the file and it's a BOM character, value 0xFEFF, then there is very little doubt about it) If you only find other upper half bytes then it's CP-1252. If you find both, it's more likely that it's CP-1252, but you'd better take a look at it; It could be a corrupt UTF-8 file. Code to test this, assuming $_ contains the whole file, and is not converted to utf-8: `my(%utf8, %single); while(/([\xC0-\xDF][\x80-\xBF]\|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]\|[\xF0 +-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])\|([\x80-\xFF])/g) { if($1) { $utf8{$1}++; } elsif($2) { $single{$1}++; } }` [download] (untested) If after this code block %single is empty and %utf8 is not empty, then it's UTF-8; if %single is not empty then it's CP-1252 with high certainty if %utf8 is empty. <You can do simpler tests than this one, that don't involve hashes, but this way it's easier to debug and verify why it decided one way, and not another way.	[reply] [d/l]
Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by moritz (Cardinal) on Jun 17, 2011 at 10:13 UTC
You can just try to decode it as UTF-8, and fall back to cp-1252 if it didn't work. See Encode, section "Handling Malformed Data". Perl 6 - second systems done right	[reply]
Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by ikegami (Patriarch) on Jun 17, 2011 at 14:35 UTC
I agree with ~~bart~~moritz. Due to some properties of UTF-8, it's very unlikely that cp1252-encoded text would be valid UTF-8. `use Encode qw( decode ); my $bytes = '...'; my $txt; if (!eval { $txt = decode('UTF-8', $bytes, Encode::FB_CROAK\|Encode::LEAVE_SRC); 1 # No exception }) { $txt = decode('Windows-1252', $bytes); }` [download] — Unless the encoded text contains no bytes above 0x7F, in which case it doesn't matter if you treat it as Windows-1252 or UTF-8.	[reply] [d/l]
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? (Areas of confusion) by ikegami (Patriarch) on Jun 17, 2011 at 15:53 UTC
That code would only guess wrong if all of the following are true: The text is encoded using Windows-1252 (or iso-8859-1), At least one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷] is present, All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß] are always followed by exactly one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿], All instances of [àáâãäåæçèéêëìíîï] are always followed by exactly two of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿], All instances of [ðñòóôõö÷] are always followed by exactly three of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿], None of [øùúûüýþÿ] are present, and None of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿] are present except where previously mentioned. In other words, that code is very reliable.	[reply] [d/l] [select]
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 16:10 UTC
my $bytes = '...'; How do I ensure that `$bytes` are bytes, not characters? I'm on Microsoft Windows and the text files are in the DOS format (i.e., CR-LF newlines) In other words, what I/O layer must I use? `'<:raw'`? Jim	[reply] [d/l] [select]
Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by ikegami (Patriarch) on Jun 17, 2011 at 16:14 UTC
`open(my $fh, '<:raw:perlio', $qfn)` [download] and `open(my $fh, '<', $qfn) binmode($fh);` [download] would do, but then you'd have to do CRLF translation. `open(my $fh, '<', $qfn)` [download] will actually work and properly do the CRLF translation (unless you set some default layers somewhere) despite decoding and CRLF translation being done in the wrong order. Note that `open(my $fh, '<:encoding(UTF-8)', $qfn)` [download] also decodes and does CRLF translation in the wrong order. That's why `open(my $fh, '<:encoding(UTF-16le)', $qfn)` [download] doesn't work on Windows (of all places!).	[reply] [d/l] [select]
Re^4: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 16:38 UTC
Re^5: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by ikegami (Patriarch) on Jun 17, 2011 at 18:55 UTC
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 15:56 UTC
Thank you very much, ikegami. Unless it's valid US-ASCII, in which case it doesn't matter if you use Windows-1252 or UTF-8. Yep. Any purely ASCII text files will simply get a UTF-8 byte order mark prefixed to them, forcing them into Unicode goodness. EBCDIC text files will be blown to smithereens. In the context of what I'm doing, I don't care. Jim	[reply]
Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by ikegami (Patriarch) on Jun 17, 2011 at 16:09 UTC
A purely US-ASCII text file cannot contain a Unicode BOM. BOM don't force Unicode goodness, whatever that means. I don't know why you bring up EBCDIC. You said only Windows-1252 and UTF-8 are possible. I changed the wording of the text you quoted in the hopes of being clearer.	[reply]
Re^4: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 16:25 UTC
Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Khen1950fx (Canon) on Jun 17, 2011 at 17:20 UTC
I ran a small series of checks for well-formedness, validity, if ascii, and if cp1252 using: String::UTF8 Search::Tools::UTF8 `#!/usr/bin/perl use strict; use warnings; use Search::Tools::UTF8; use String::UTF8 qw(:all); my $text = 'There are those of you out there stuck with Latin-1.'; print my $str = is_utf8($text), "\n", #check if well-formed is_valid_utf8($text), "\n", is_ascii($text), "\n", looks_like_cp1252($text), "\n";` [download] It outputs: `1 1 1 0` [download] It's well-formed, valid utf8. It's also ascii but not cp1252. The well-formed test comes from String::UTF8, while the other methods come from Search::Tools::UTF8. Does this help?	[reply] [d/l] [select]
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by ikegami (Patriarch) on Jun 17, 2011 at 18:50 UTC
`looks_like_cp1252` is useless in this context. The following string is unambiguously cp1252, yet looks_like_cp1252 reports otherwise. `#!/usr/bin/perl use strict; use warnings; use feature qw( say ); use Search::Tools::UTF8 qw( looks_like_cp1252 ); my $text = "\xC9ric"; say looks_like_cp1252($text) ?1:0; # 0` [download] Therefore, you appear to be recommending the use of `my $txt; if (is_valid_utf8($text)) { $txt = decode('UTF-8', $bytes); } else { $txt = decode('Windows-1252', $bytes); }` [download] But that requires parsing UTF-8 strings twice for nothing. That is why I didn't mention this possibility when I posted a solution that only parses UTF-8 strings once. `my $bytes = '...'; my $txt; if (!eval { $txt = decode('UTF-8', $bytes, Encode::FB_CROAK\|Encode::LEAVE_SRC); 1 # No exception }) { $txt = decode('Windows-1252', $bytes); }` [download]	[reply] [d/l] [select]
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 17:38 UTC
It does indeed! Thank you very much, ++Khen1950fx! Jim	[reply]
Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by grantm (Parson) on Jun 18, 2011 at 00:44 UTC
You might want to look at Encoding-FixLatin - I created it for a very similar situation. In my case I had a Postgres database from an application that had treated text as 8-bit binary strings. Each record was one of: ASCII, UTF-8, ISO-8859-1 or CP1252, but the DB dump as a whole was a mixture of all these. The documentation for Encoding::FixLatin describes the heuristics it uses.	[reply]
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Khen1950fx (Canon) on Jun 18, 2011 at 11:37 UTC
I tried your module using ikegami's cp1252. It works for me: #!/usr/bin/perl use Modern::Perl; use Search::Tools::UTF8; use Encoding::FixLatin qw(fix_latin); use Encode::Locale; use Encode; if ( -t ) { binmode(STDIN, ":encoding(console_in)"); binmode(STDOUT, ":encoding(console_out)"); binmode(STDERR, ":encoding(console_out)"); } my $text = "\xC9ric"; if (is_latin1($text) eq 1) { say "$text is latin1"; } else { return; } my $fix = fix_latin($text, ascii_hex => 0); if (looks_like_cp1252($fix) eq 0) { say "$fix cannot be mapped to utf8:-)"; } else { return; } say is_flagged_utf8($fix); say is_sane_utf8($fix); say is_valid_utf8($fix); [download]	[reply] [d/l]