in reply to Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
in thread What's the best way to detect character encodings, Windows-1252 v. UTF-8?

looks_like_cp1252 is useless in this context. The following string is unambiguously cp1252, yet looks_like_cp1252 reports otherwise.

#!/usr/bin/perl use strict; use warnings; use feature qw( say ); use Search::Tools::UTF8 qw( looks_like_cp1252 ); my $text = "\xC9ric"; say looks_like_cp1252($text) ?1:0; # 0

Therefore, you appear to be recommending the use of

my $txt; if (is_valid_utf8($text)) { $txt = decode('UTF-8', $bytes); } else { $txt = decode('Windows-1252', $bytes); }

But that requires parsing UTF-8 strings twice for nothing. That is why I didn't mention this possibility when I posted a solution that only parses UTF-8 strings once.

my $bytes = '...'; my $txt; if (!eval { $txt = decode('UTF-8', $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC); 1 # No exception }) { $txt = decode('Windows-1252', $bytes); }