Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

I ran a small series of checks for well-formedness, validity, if ascii, and if cp1252 using:

#!/usr/bin/perl

use strict;
use warnings;
use Search::Tools::UTF8;
use String::UTF8 qw(:all);

my $text = 'There are those of you out there stuck with Latin-1.';
print my $str =
is_utf8($text), "\n", #check if well-formed 
is_valid_utf8($text), "\n",
is_ascii($text), "\n",
looks_like_cp1252($text), "\n";
[download]

It outputs:

1
1
1
0
[download]

It's well-formed, valid utf8. It's also ascii but not cp1252. The well-formed test comes from String::UTF8, while the other methods come from Search::Tools::UTF8. Does this help?

Comment on Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8? Select or Download Code

Replies are listed 'Best First'.
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by ikegami (Patriarch) on Jun 17, 2011 at 18:50 UTC
`looks_like_cp1252` is useless in this context. The following string is unambiguously cp1252, yet looks_like_cp1252 reports otherwise. `#!/usr/bin/perl use strict; use warnings; use feature qw( say ); use Search::Tools::UTF8 qw( looks_like_cp1252 ); my $text = "\xC9ric"; say looks_like_cp1252($text) ?1:0; # 0` [download] Therefore, you appear to be recommending the use of `my $txt; if (is_valid_utf8($text)) { $txt = decode('UTF-8', $bytes); } else { $txt = decode('Windows-1252', $bytes); }` [download] But that requires parsing UTF-8 strings twice for nothing. That is why I didn't mention this possibility when I posted a solution that only parses UTF-8 strings once. `my $bytes = '...'; my $txt; if (!eval { $txt = decode('UTF-8', $bytes, Encode::FB_CROAK\|Encode::LEAVE_SRC); 1 # No exception }) { $txt = decode('Windows-1252', $bytes); }` [download]	[reply] [d/l] [select]
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 17:38 UTC
It does indeed! Thank you very much, ++Khen1950fx! Jim	[reply]