comment on

I was looking for a small set of extended ascii characters to "flatten" (if you will) to an ascii counterpart

Sounds very much like Text::Unidecode!

I do not know the encodings of the text files that I'm reading. They were probably exported as CSV from Excel or created by a Perl script from reading an Open Office .ods file. Tab delimited text files created differently over the course of 20+ years. That would make sense though since it's the older files that have a single byte eacute, then all of the sudden the two-byte eacute is the only variety found.

Yes, that does sound likely. Here's a very simple example of how one might tell the difference between three of the encodings I named. Of course if you have more encodings than this, things can get more complex, and even if these encodings seem to work, you'll probably need to tweak the heuristics in the below example to fit your actual data.

#!/usr/bin/env perl
use warnings;
use strict;
use open qw/:std :utf8/;
use Text::Unidecode;
use Encode;

print "# Text::Unidecode demo:\n";
my $test = "\N{U+201C}test\N{U+201D} \N{U+2013} test\N{U+2026}";
print " original: ", $test, "\n";
print "asciified: ", unidecode($test), "\n";

# set up some test data
my $str = "\N{U+CF} spent 20\N{U+20AC} \N{U+C3}t the c\N{U+AA}f\N{U+E9
+}\n";
{
    open my $fh1, '>:raw:encoding(CP-1252)', 'one.txt' or die $!;
    print $fh1 $str;
    close $fh1;
    open my $fh2, '>:raw:encoding(Latin-9)', 'two.txt' or die $!;
    print $fh2 $str;
    close $fh2;
    open my $fh3, '>:raw:encoding(UTF-8)', 'three.txt' or die $!;
    print $fh3 $str;
    close $fh3;
}

my $expected_chars = qr/[\N{U+20AC}]/u; # heuristic
my $unexpected_chars = qr/[\N{U+80}]/u; # heuristic
for my $file (qw/ one.txt two.txt three.txt /) {
    # slurp the raw file as undecoded bytes
    open my $fh, '<:raw', $file or die "$file: $!";
    my $bytes = do { local $/; <$fh> };
    close $fh;
    my $string;
    # try different encodings
    for my $enc (qw/ UTF-8 Latin-9 CP-1252 /) {
        $string = eval { decode($enc, $bytes,
            Encode::FB_CROAK|Encode::LEAVE_SRC) };
        if ( defined $string && $string =~ $expected_chars
            && $string !~ $unexpected_chars )
                { print "### $file looks like $enc\n"; last }
        else { print "### $file is NOT $enc\n" }
    }
    die "Failed to decode $file" unless defined $string;
    print $string;
    print unidecode($string);
}
[download]

Output (on a terminal with UTF-8 encoding):

# Text::Unidecode demo:
 original: “test” – test…
asciified: "test" - test...
### one.txt is NOT UTF-8
### one.txt is NOT Latin-9
### one.txt looks like CP-1252
Ď spent 20€ Ăt the cŞfé
I spent 20EUR At the cafe
### two.txt is NOT UTF-8
### two.txt looks like Latin-9
Ď spent 20€ Ăt the cŞfé
I spent 20EUR At the cafe
### three.txt looks like UTF-8
Ď spent 20€ Ăt the cŞfé
I spent 20EUR At the cafe

In reply to Re^3: Two octal values for eacute? by haukex
in thread Two octal values for eacute? by pianomonious

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.