looks_like_cp1252 is useless in this context. The following string is unambiguously cp1252, yet looks_like_cp1252 reports otherwise.
#!/usr/bin/perl use strict; use warnings; use feature qw( say ); use Search::Tools::UTF8 qw( looks_like_cp1252 ); my $text = "\xC9ric"; say looks_like_cp1252($text) ?1:0; # 0
Therefore, you appear to be recommending the use of
my $txt; if (is_valid_utf8($text)) { $txt = decode('UTF-8', $bytes); } else { $txt = decode('Windows-1252', $bytes); }
But that requires parsing UTF-8 strings twice for nothing. That is why I didn't mention this possibility when I posted a solution that only parses UTF-8 strings once.
my $bytes = '...'; my $txt; if (!eval { $txt = decode('UTF-8', $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC); 1 # No exception }) { $txt = decode('Windows-1252', $bytes); }
In reply to Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by ikegami
in thread What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by Jim
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |