Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by bart (Canon) on Jun 17, 2011 at 11:35 UTC
|
There are byte sequences that are typical for UTF-8. The first byte of a UTF-8 character must be in the range 0xC0-0xF7 (0xC0-0xDF for 2 bytes; 0xE0-0xEF for 3 bytes; and 0xF0-0xF7 for 4 byte sequences), and all the next bytes are in the range 0x80-0xBF. So if you see an accented character that is not part of such a sequence, you simply know it's not UTF-8. You might guess it's probably IS0-Latin-1 (= ISO-8859-1) or Microsoft's extension of it, the Windows character set AKA CP-1252; but that's not necessarily the case. It could be DOS text, for example... or ISO-8859-15.
You could use heuristical/statistical methods and simply base a guess on the frequency of occurence of bytes (the repertoire) what kind of encoding it is, for example in a French text you'll find lots of "é", "è", "ê", "à" and "ç", but something like "þ" will be extremely rare.
I'm guessing there will also be modules to help you, like Encode::Guess, but I've never used it. I haven't had the need for it, thus far, but it might be better than trying to come up with something elaborate yourself. On the other hand, this particular module is focused on Far Eastern encodings (for Japanese and Chinese, among others) so it might not be the best fit for your purpose.
References:
| [reply] |
|
|
Thank you very much, Bart.
As I wrote in my inquiry, "I know each file is in one of exactly two different character encodings: Windows-1252 or UTF-8." So I don't have to worry about the various ISO-8859 character sets.
As I mentioned, "I considered using Encode::Guess, but rejected it because it seems hinky." I read criticism of it that suggested it's no good at doing precisely what I need to do: simply to distinguish between Windows-1252 and UTF-8 character encodings in text that is predominantly in the Latin script—mostly in English with incidental text in other Western European languages.
Jim
| [reply] |
|
|
Well then here's how I'd do it. I'd check the whole file for UTF-8 sequences and any other bytes with value 128 or above.
- If you find no bytes with value 128-255, then the file is ASCII (or CP-1252 or UTF-8, they're all the same here.)
- If you only find valid UTF-8 byte sequences then it's probably UTF-8. (If the first sequence is at the start of the file and it's a BOM character, value 0xFEFF, then there is very little doubt about it)
- If you only find other upper half bytes then it's CP-1252.
- If you find both, it's more likely that it's CP-1252, but you'd better take a look at it; It could be a corrupt UTF-8 file.
Code to test this, assuming $_ contains the whole file, and is not converted to utf-8:
my(%utf8, %single);
while(/([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]|[\xF0
+-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])|([\x80-\xFF])/g) {
if($1) {
$utf8{$1}++;
} elsif($2) {
$single{$1}++;
}
}
(untested)
If after this code block %single is empty and %utf8 is not empty, then it's UTF-8; if %single is not empty then it's CP-1252 with high certainty if %utf8 is empty.
<You can do simpler tests than this one, that don't involve hashes, but this way it's easier to debug and verify why it decided one way, and not another way. | [reply] [d/l] |
Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by moritz (Cardinal) on Jun 17, 2011 at 10:13 UTC
|
You can just try to decode it as UTF-8, and fall back to cp-1252 if it didn't work. See Encode, section "Handling Malformed Data".
| [reply] |
Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by ikegami (Patriarch) on Jun 17, 2011 at 14:35 UTC
|
I agree with bartmoritz. Due to some properties of UTF-8, it's very unlikely that cp1252-encoded text would be valid UTF-8*.
use Encode qw( decode );
my $bytes = '...';
my $txt;
if (!eval {
$txt = decode('UTF-8', $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
1 # No exception
}) {
$txt = decode('Windows-1252', $bytes);
}
* — Unless the encoded text contains no bytes above 0x7F, in which case it doesn't matter if you treat it as Windows-1252 or UTF-8.
| [reply] [d/l] |
|
|
That code would only guess wrong if all of the following are true:
- The text is encoded using Windows-1252 (or iso-8859-1),
- At least one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷] is present,
- All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß] are always followed by exactly one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of [àáâãäåæçèéêëìíîï] are always followed by exactly two of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of [ðñòóôõö÷] are always followed by exactly three of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- None of [øùúûüýþÿ] are present, and
- None of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿] are present except where previously mentioned.
In other words, that code is very reliable.
| [reply] [d/l] [select] |
|
|
my $bytes = '...';
How do I ensure that $bytes are bytes, not characters? I'm on Microsoft Windows and the text files are in the DOS format (i.e., CR-LF newlines) In other words, what I/O layer must I use? '<:raw'?
Jim
| [reply] [d/l] [select] |
|
|
open(my $fh, '<:raw:perlio', $qfn)
and
open(my $fh, '<', $qfn)
binmode($fh);
would do, but then you'd have to do CRLF translation.
open(my $fh, '<', $qfn)
will actually work and properly do the CRLF translation (unless you set some default layers somewhere) despite decoding and CRLF translation being done in the wrong order. Note that
open(my $fh, '<:encoding(UTF-8)', $qfn)
also decodes and does CRLF translation in the wrong order. That's why
open(my $fh, '<:encoding(UTF-16le)', $qfn)
doesn't work on Windows (of all places!).
| [reply] [d/l] [select] |
|
|
|
|
|
|
Thank you very much, ikegami.
Unless it's valid US-ASCII, in which case it doesn't matter if you use Windows-1252 or UTF-8.
Yep. Any purely ASCII text files will simply get a UTF-8 byte order mark prefixed to them, forcing them into Unicode goodness.
EBCDIC text files will be blown to smithereens. In the context of what I'm doing, I don't care.
Jim
| [reply] |
|
|
| [reply] |
|
|
Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by Khen1950fx (Canon) on Jun 17, 2011 at 17:20 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
use Search::Tools::UTF8;
use String::UTF8 qw(:all);
my $text = 'There are those of you out there stuck with Latin-1.';
print my $str =
is_utf8($text), "\n", #check if well-formed
is_valid_utf8($text), "\n",
is_ascii($text), "\n",
looks_like_cp1252($text), "\n";
It outputs:
1
1
1
0
It's well-formed, valid utf8. It's also ascii but not
cp1252. The well-formed test comes from String::UTF8,
while the other methods come from Search::Tools::UTF8.
Does this help?
| [reply] [d/l] [select] |
|
|
#!/usr/bin/perl
use strict;
use warnings;
use feature qw( say );
use Search::Tools::UTF8 qw( looks_like_cp1252 );
my $text = "\xC9ric";
say looks_like_cp1252($text) ?1:0; # 0
Therefore, you appear to be recommending the use of
my $txt;
if (is_valid_utf8($text)) {
$txt = decode('UTF-8', $bytes);
} else {
$txt = decode('Windows-1252', $bytes);
}
But that requires parsing UTF-8 strings twice for nothing. That is why I didn't mention this possibility when I posted a solution that only parses UTF-8 strings once.
my $bytes = '...';
my $txt;
if (!eval {
$txt = decode('UTF-8', $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
1 # No exception
}) {
$txt = decode('Windows-1252', $bytes);
}
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by grantm (Parson) on Jun 18, 2011 at 00:44 UTC
|
You might want to look at Encoding-FixLatin - I created it for a very similar situation. In my case I had a Postgres database from an application that had treated text as 8-bit binary strings. Each record was one of: ASCII, UTF-8, ISO-8859-1 or CP1252, but the DB dump as a whole was a mixture of all these. The documentation for Encoding::FixLatin describes the heuristics it uses.
| [reply] |
|
|
I tried your module using ikegami's cp1252. It works for me:
#!/usr/bin/perl
use Modern::Perl;
use Search::Tools::UTF8;
use Encoding::FixLatin qw(fix_latin);
use Encode::Locale;
use Encode;
if ( -t ) {
binmode(STDIN, ":encoding(console_in)");
binmode(STDOUT, ":encoding(console_out)");
binmode(STDERR, ":encoding(console_out)");
}
my $text = "\xC9ric";
if (is_latin1($text) eq 1) {
say "$text is latin1";
}
else {
return;
}
my $fix = fix_latin($text, ascii_hex => 0);
if (looks_like_cp1252($fix) eq 0) {
say "$fix cannot be mapped to utf8:-)";
}
else {
return;
}
say is_flagged_utf8($fix);
say is_sane_utf8($fix);
say is_valid_utf8($fix);
| [reply] [d/l] |