Re^2: Two octal values for eacute?

Thank you haukex!

I feel like I once skimmed that first link you posted, but it's been years ago. I do have some refreshing to do then.

You are correct - I do not know the encodings of the text files that I'm reading. They were probably exported as CSV from Excel or created by a Perl script from reading an Open Office .ods file. Tab delimited text files created differently over the course of 20+ years. That would make sense though since it's the older files that have a single byte eacute, then all of the sudden the two-byte eacute is the only variety found.

I will read through the links "best practices" and all. Much appreciated there!!

Oh, and I was looking for a small set of extended ascii characters to "flatten" (if you will) to an ascii counterpart as I could not reliably reproduce them - again pointing to the fact that they were probably encoded differently. I used a small subroutine to make two differently encoded eacutes into an 'e' to mitigate these headaches. The same sub also translated ellipses to '...', curved left/right double-quotes to straight double-quotes, long dashes to normal dashes and so on. All of these things that a spreadsheet program automatically substitutes in when you're typing. I didn't think of the encoding so much, but instead found octal regexes that could pluck out each of these characters so that I could insert what I felt was a suitable replacement. Nothing personal against the eacute!

Thank you so much for your time and expertise!

Comment on Re^2: Two octal values for eacute?

Replies are listed 'Best First'.
Re^3: Two octal values for eacute? by haukex (Archbishop) on May 24, 2020 at 14:25 UTC
I was looking for a small set of extended ascii characters to "flatten" (if you will) to an ascii counterpart Sounds very much like Text::Unidecode! I do not know the encodings of the text files that I'm reading. They were probably exported as CSV from Excel or created by a Perl script from reading an Open Office .ods file. Tab delimited text files created differently over the course of 20+ years. That would make sense though since it's the older files that have a single byte eacute, then all of the sudden the two-byte eacute is the only variety found. Yes, that does sound likely. Here's a very simple example of how one might tell the difference between three of the encodings I named. Of course if you have more encodings than this, things can get more complex, and even if these encodings seem to work, you'll probably need to tweak the heuristics in the below example to fit your actual data. #!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; use Text::Unidecode; use Encode; print "# Text::Unidecode demo:\n"; my $test = "\N{U+201C}test\N{U+201D} \N{U+2013} test\N{U+2026}"; print " original: ", $test, "\n"; print "asciified: ", unidecode($test), "\n"; # set up some test data my $str = "\N{U+CF} spent 20\N{U+20AC} \N{U+C3}t the c\N{U+AA}f\N{U+E9 +}\n"; { open my $fh1, '>:raw:encoding(CP-1252)', 'one.txt' or die $!; print $fh1 $str; close $fh1; open my $fh2, '>:raw:encoding(Latin-9)', 'two.txt' or die $!; print $fh2 $str; close $fh2; open my $fh3, '>:raw:encoding(UTF-8)', 'three.txt' or die $!; print $fh3 $str; close $fh3; } my $expected_chars = qr/[\N{U+20AC}]/u; # heuristic my $unexpected_chars = qr/[\N{U+80}]/u; # heuristic for my $file (qw/ one.txt two.txt three.txt /) { # slurp the raw file as undecoded bytes open my $fh, '<:raw', $file or die "$file: $!"; my $bytes = do { local $/; <$fh> }; close $fh; my $string; # try different encodings for my $enc (qw/ UTF-8 Latin-9 CP-1252 /) { $string = eval { decode($enc, $bytes, Encode::FB_CROAK\|Encode::LEAVE_SRC) }; if ( defined $string && $string =~ $expected_chars && $string !~ $unexpected_chars ) { print "### $file looks like $enc\n"; last } else { print "### $file is NOT $enc\n" } } die "Failed to decode $file" unless defined $string; print $string; print unidecode($string); } [download] Output (on a terminal with UTF-8 encoding): # Text::Unidecode demo: original: “test” – test… asciified: "test" - test... ### one.txt is NOT UTF-8 ### one.txt is NOT Latin-9 ### one.txt looks like CP-1252 Ď spent 20€ Ăt the cŞfé I spent 20EUR At the cafe ### two.txt is NOT UTF-8 ### two.txt looks like Latin-9 Ď spent 20€ Ăt the cŞfé I spent 20EUR At the cafe ### three.txt looks like UTF-8 Ď spent 20€ Ăt the cŞfé I spent 20EUR At the cafe	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: Two octal values for eacute?
by haukex (Archbishop) on May 24, 2020 at 14:25 UTC

I was looking for a small set of extended ascii characters to "flatten" (if you will) to an ascii counterpart

Sounds very much like Text::Unidecode!

I do not know the encodings of the text files that I'm reading. They were probably exported as CSV from Excel or created by a Perl script from reading an Open Office .ods file. Tab delimited text files created differently over the course of 20+ years. That would make sense though since it's the older files that have a single byte eacute, then all of the sudden the two-byte eacute is the only variety found.

Yes, that does sound likely. Here's a very simple example of how one might tell the difference between three of the encodings I named. Of course if you have more encodings than this, things can get more complex, and even if these encodings seem to work, you'll probably need to tweak the heuristics in the below example to fit your actual data.

#!/usr/bin/env perl
use warnings;
use strict;
use open qw/:std :utf8/;
use Text::Unidecode;
use Encode;

print "# Text::Unidecode demo:\n";
my $test = "\N{U+201C}test\N{U+201D} \N{U+2013} test\N{U+2026}";
print " original: ", $test, "\n";
print "asciified: ", unidecode($test), "\n";

# set up some test data
my $str = "\N{U+CF} spent 20\N{U+20AC} \N{U+C3}t the c\N{U+AA}f\N{U+E9
+}\n";
{
    open my $fh1, '>:raw:encoding(CP-1252)', 'one.txt' or die $!;
    print $fh1 $str;
    close $fh1;
    open my $fh2, '>:raw:encoding(Latin-9)', 'two.txt' or die $!;
    print $fh2 $str;
    close $fh2;
    open my $fh3, '>:raw:encoding(UTF-8)', 'three.txt' or die $!;
    print $fh3 $str;
    close $fh3;
}

my $expected_chars = qr/[\N{U+20AC}]/u; # heuristic
my $unexpected_chars = qr/[\N{U+80}]/u; # heuristic
for my $file (qw/ one.txt two.txt three.txt /) {
    # slurp the raw file as undecoded bytes
    open my $fh, '<:raw', $file or die "$file: $!";
    my $bytes = do { local $/; <$fh> };
    close $fh;
    my $string;
    # try different encodings
    for my $enc (qw/ UTF-8 Latin-9 CP-1252 /) {
        $string = eval { decode($enc, $bytes,
            Encode::FB_CROAK|Encode::LEAVE_SRC) };
        if ( defined $string && $string =~ $expected_chars
            && $string !~ $unexpected_chars )
                { print "### $file looks like $enc\n"; last }
        else { print "### $file is NOT $enc\n" }
    }
    die "Failed to decode $file" unless defined $string;
    print $string;
    print unidecode($string);
}
[download]

Output (on a terminal with UTF-8 encoding):

# Text::Unidecode demo:
 original: “test” – test…
asciified: "test" - test...
### one.txt is NOT UTF-8
### one.txt is NOT Latin-9
### one.txt looks like CP-1252
Ď spent 20€ Ăt the cŞfé
I spent 20EUR At the cafe
### two.txt is NOT UTF-8
### two.txt looks like Latin-9
Ď spent 20€ Ăt the cŞfé
I spent 20EUR At the cafe
### three.txt looks like UTF-8
Ď spent 20€ Ăt the cŞfé
I spent 20EUR At the cafe

[reply]
[d/l]