in reply to Re^2: Two octal values for eacute?
in thread Two octal values for eacute?
I was looking for a small set of extended ascii characters to "flatten" (if you will) to an ascii counterpart
Sounds very much like Text::Unidecode!
I do not know the encodings of the text files that I'm reading. They were probably exported as CSV from Excel or created by a Perl script from reading an Open Office .ods file. Tab delimited text files created differently over the course of 20+ years. That would make sense though since it's the older files that have a single byte eacute, then all of the sudden the two-byte eacute is the only variety found.
Yes, that does sound likely. Here's a very simple example of how one might tell the difference between three of the encodings I named. Of course if you have more encodings than this, things can get more complex, and even if these encodings seem to work, you'll probably need to tweak the heuristics in the below example to fit your actual data.
#!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; use Text::Unidecode; use Encode; print "# Text::Unidecode demo:\n"; my $test = "\N{U+201C}test\N{U+201D} \N{U+2013} test\N{U+2026}"; print " original: ", $test, "\n"; print "asciified: ", unidecode($test), "\n"; # set up some test data my $str = "\N{U+CF} spent 20\N{U+20AC} \N{U+C3}t the c\N{U+AA}f\N{U+E9 +}\n"; { open my $fh1, '>:raw:encoding(CP-1252)', 'one.txt' or die $!; print $fh1 $str; close $fh1; open my $fh2, '>:raw:encoding(Latin-9)', 'two.txt' or die $!; print $fh2 $str; close $fh2; open my $fh3, '>:raw:encoding(UTF-8)', 'three.txt' or die $!; print $fh3 $str; close $fh3; } my $expected_chars = qr/[\N{U+20AC}]/u; # heuristic my $unexpected_chars = qr/[\N{U+80}]/u; # heuristic for my $file (qw/ one.txt two.txt three.txt /) { # slurp the raw file as undecoded bytes open my $fh, '<:raw', $file or die "$file: $!"; my $bytes = do { local $/; <$fh> }; close $fh; my $string; # try different encodings for my $enc (qw/ UTF-8 Latin-9 CP-1252 /) { $string = eval { decode($enc, $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC) }; if ( defined $string && $string =~ $expected_chars && $string !~ $unexpected_chars ) { print "### $file looks like $enc\n"; last } else { print "### $file is NOT $enc\n" } } die "Failed to decode $file" unless defined $string; print $string; print unidecode($string); }
Output (on a terminal with UTF-8 encoding):
# Text::Unidecode demo: original: “test” – test… asciified: "test" - test... ### one.txt is NOT UTF-8 ### one.txt is NOT Latin-9 ### one.txt looks like CP-1252 Ï spent 20€ Ãt the cªfé I spent 20EUR At the cafe ### two.txt is NOT UTF-8 ### two.txt looks like Latin-9 Ï spent 20€ Ãt the cªfé I spent 20EUR At the cafe ### three.txt looks like UTF-8 Ï spent 20€ Ãt the cªfé I spent 20EUR At the cafe
|
|---|