in reply to What's the best way to detect character encodings, Windows-1252 v. UTF-8?

I ran a small series of checks for well-formedness, validity, if ascii, and if cp1252 using:

String::UTF8
Search::Tools::UTF8

#!/usr/bin/perl use strict; use warnings; use Search::Tools::UTF8; use String::UTF8 qw(:all); my $text = 'There are those of you out there stuck with Latin-1.'; print my $str = is_utf8($text), "\n", #check if well-formed is_valid_utf8($text), "\n", is_ascii($text), "\n", looks_like_cp1252($text), "\n";
It outputs:
1 1 1 0
It's well-formed, valid utf8. It's also ascii but not cp1252. The well-formed test comes from String::UTF8, while the other methods come from Search::Tools::UTF8. Does this help?

Replies are listed 'Best First'.
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by ikegami (Patriarch) on Jun 17, 2011 at 18:50 UTC

    looks_like_cp1252 is useless in this context. The following string is unambiguously cp1252, yet looks_like_cp1252 reports otherwise.

    #!/usr/bin/perl use strict; use warnings; use feature qw( say ); use Search::Tools::UTF8 qw( looks_like_cp1252 ); my $text = "\xC9ric"; say looks_like_cp1252($text) ?1:0; # 0

    Therefore, you appear to be recommending the use of

    my $txt; if (is_valid_utf8($text)) { $txt = decode('UTF-8', $bytes); } else { $txt = decode('Windows-1252', $bytes); }

    But that requires parsing UTF-8 strings twice for nothing. That is why I didn't mention this possibility when I posted a solution that only parses UTF-8 strings once.

    my $bytes = '...'; my $txt; if (!eval { $txt = decode('UTF-8', $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC); 1 # No exception }) { $txt = decode('Windows-1252', $bytes); }
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by Jim (Curate) on Jun 17, 2011 at 17:38 UTC

    It does indeed! Thank you very much, ++Khen1950fx!

    Jim