Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

You might want to look at Encoding-FixLatin - I created it for a very similar situation. In my case I had a Postgres database from an application that had treated text as 8-bit binary strings. Each record was one of: ASCII, UTF-8, ISO-8859-1 or CP1252, but the DB dump as a whole was a mixture of all these. The documentation for Encoding::FixLatin describes the heuristics it uses.

Comment on Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Replies are listed 'Best First'.
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Khen1950fx (Canon) on Jun 18, 2011 at 11:37 UTC
I tried your module using ikegami's cp1252. It works for me: #!/usr/bin/perl use Modern::Perl; use Search::Tools::UTF8; use Encoding::FixLatin qw(fix_latin); use Encode::Locale; use Encode; if ( -t ) { binmode(STDIN, ":encoding(console_in)"); binmode(STDOUT, ":encoding(console_out)"); binmode(STDERR, ":encoding(console_out)"); } my $text = "\xC9ric"; if (is_latin1($text) eq 1) { say "$text is latin1"; } else { return; } my $fix = fix_latin($text, ascii_hex => 0); if (looks_like_cp1252($fix) eq 0) { say "$fix cannot be mapped to utf8:-)"; } else { return; } say is_flagged_utf8($fix); say is_sane_utf8($fix); say is_valid_utf8($fix); [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by Khen1950fx (Canon) on Jun 18, 2011 at 11:37 UTC

ikegami

#!/usr/bin/perl

use Modern::Perl;

use Search::Tools::UTF8;
use Encoding::FixLatin qw(fix_latin);
use Encode::Locale;
use Encode;

if ( -t ) {
        binmode(STDIN,  ":encoding(console_in)");
        binmode(STDOUT, ":encoding(console_out)");
        binmode(STDERR, ":encoding(console_out)");
}

my $text = "\xC9ric";
if (is_latin1($text) eq 1)  {
    say "$text is latin1";
}
else {
    return;
}

my $fix = fix_latin($text, ascii_hex => 0);
if (looks_like_cp1252($fix) eq 0) {
    say "$fix cannot be mapped to utf8:-)";
}
else {
    return;
}
say is_flagged_utf8($fix);
say is_sane_utf8($fix);
say is_valid_utf8($fix);
[download]

[reply]
[d/l]