pianomonious has asked for the wisdom of the Perl Monks concerning the following question:

Friends... I'm slowing losing my mind trying to figure out why I must capture an eacute character (as in café) via two octal regex patterns.

In short, I parse old text files and often come across some extended ASCII characters like en-dash, ellipsis, eacute, etc. which were encoded that way by some spreadsheet program like Excel or Open Office Calc.

Here are the two regexes that capture eacute for me:

if ($field =~ /\351/) { ... } if ($field =~ /\303\251/) { ... }

The first variation (octal 351) agrees with the ASCII table shown here:

https://www.ascii-code.com

My terminal program cannot display this character, and this online octal-to-ascii converter cannot either:

https://onlineasciitools.com/convert-octal-to-ascii

Yet, my Firefox browser is able to render this eacute character properly, when reading it from a text file.

The second variation (octal 303 251) is not mentioned in any ASCII table, but the eacute symbol is rendered correctly by my terminal program and can be properly converted by the octal-to-ascii converter mentioned above. As well, Firefox can render this properly from a text file.

Could someone please shed some light on what is happening?

Thanks in advance, and my apologies if I'm missing something obvious.

Replies are listed 'Best First'.
Re: Two octal values for eacute?
by haukex (Archbishop) on May 23, 2020 at 21:16 UTC

    Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

    The character U+00E9 LATIN SMALL LETTER E WITH ACUTE (é) is encoded in Latin-1, Latin-9, and CP-1252 as the single byte \xE9 (\351), but when encoded with UTF-8, it's the two-byte sequence \xC3\xA9 (\303\251).

    In other words, some of your files are encoded with one of the single-byte encodings, others are encoded with UTF-8, and you'll have to specify the correct encoding when opening them, as in e.g. open my $fh, '<:raw:encoding(UTF-8)', $filename or die "$filename $!"; (see "open" Best Practices). That way, when you read the data into Perl, the characters are correctly decoded and you'll always have the correct characters (e.g. "\N{U+00E9}") in your Perl strings.

    If you don't know the encoding of the input files, you could use a module like Encode::Guess, or I've written a tool that tries to be a little smarter: enctool - it allows you to narrow down the guesses by specifying what characters are expected to appear in the input file using e.g. the --one-of='\xE9' option. Some files, like HTML and XML, will often include a definition of the character set in their source, and (except for the cases where that declaration is incorrect) the appropriate parser modules (e.g. XML::LibXML) should honor that encoding.

    As an aside, if you're putting Unicode characters in your Perl source, you should save it as UTF-8 and add use utf8; at the top of the file. If you're writing Unicode characters to the console, add use open qw/:std :utf8/;. And of course always Use strict and warnings, and a recent version of Perl is strongly recommended when working with Unicode.

    If you have further issues with encodings when reading files, please see the tips for posting questions in this node.

    By the way, why are you looking for "é" characters in the first place? Maybe there's a more efficient way to do what you're doing with your regex, if you tell us what the task is.

      Thank you haukex!

      I feel like I once skimmed that first link you posted, but it's been years ago. I do have some refreshing to do then.

      You are correct - I do not know the encodings of the text files that I'm reading. They were probably exported as CSV from Excel or created by a Perl script from reading an Open Office .ods file. Tab delimited text files created differently over the course of 20+ years. That would make sense though since it's the older files that have a single byte eacute, then all of the sudden the two-byte eacute is the only variety found.

      I will read through the links "best practices" and all. Much appreciated there!!

      Oh, and I was looking for a small set of extended ascii characters to "flatten" (if you will) to an ascii counterpart as I could not reliably reproduce them - again pointing to the fact that they were probably encoded differently. I used a small subroutine to make two differently encoded eacutes into an 'e' to mitigate these headaches. The same sub also translated ellipses to '...', curved left/right double-quotes to straight double-quotes, long dashes to normal dashes and so on. All of these things that a spreadsheet program automatically substitutes in when you're typing. I didn't think of the encoding so much, but instead found octal regexes that could pluck out each of these characters so that I could insert what I felt was a suitable replacement. Nothing personal against the eacute!

      Thank you so much for your time and expertise!

        I was looking for a small set of extended ascii characters to "flatten" (if you will) to an ascii counterpart

        Sounds very much like Text::Unidecode!

        I do not know the encodings of the text files that I'm reading. They were probably exported as CSV from Excel or created by a Perl script from reading an Open Office .ods file. Tab delimited text files created differently over the course of 20+ years. That would make sense though since it's the older files that have a single byte eacute, then all of the sudden the two-byte eacute is the only variety found.

        Yes, that does sound likely. Here's a very simple example of how one might tell the difference between three of the encodings I named. Of course if you have more encodings than this, things can get more complex, and even if these encodings seem to work, you'll probably need to tweak the heuristics in the below example to fit your actual data.

        #!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; use Text::Unidecode; use Encode; print "# Text::Unidecode demo:\n"; my $test = "\N{U+201C}test\N{U+201D} \N{U+2013} test\N{U+2026}"; print " original: ", $test, "\n"; print "asciified: ", unidecode($test), "\n"; # set up some test data my $str = "\N{U+CF} spent 20\N{U+20AC} \N{U+C3}t the c\N{U+AA}f\N{U+E9 +}\n"; { open my $fh1, '>:raw:encoding(CP-1252)', 'one.txt' or die $!; print $fh1 $str; close $fh1; open my $fh2, '>:raw:encoding(Latin-9)', 'two.txt' or die $!; print $fh2 $str; close $fh2; open my $fh3, '>:raw:encoding(UTF-8)', 'three.txt' or die $!; print $fh3 $str; close $fh3; } my $expected_chars = qr/[\N{U+20AC}]/u; # heuristic my $unexpected_chars = qr/[\N{U+80}]/u; # heuristic for my $file (qw/ one.txt two.txt three.txt /) { # slurp the raw file as undecoded bytes open my $fh, '<:raw', $file or die "$file: $!"; my $bytes = do { local $/; <$fh> }; close $fh; my $string; # try different encodings for my $enc (qw/ UTF-8 Latin-9 CP-1252 /) { $string = eval { decode($enc, $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC) }; if ( defined $string && $string =~ $expected_chars && $string !~ $unexpected_chars ) { print "### $file looks like $enc\n"; last } else { print "### $file is NOT $enc\n" } } die "Failed to decode $file" unless defined $string; print $string; print unidecode($string); }

        Output (on a terminal with UTF-8 encoding):

        # Text::Unidecode demo:
         original: “test” – test…
        asciified: "test" - test...
        ### one.txt is NOT UTF-8
        ### one.txt is NOT Latin-9
        ### one.txt looks like CP-1252
        Ï spent 20€ Ãt the cªfé
        I spent 20EUR At the cafe
        ### two.txt is NOT UTF-8
        ### two.txt looks like Latin-9
        Ï spent 20€ Ãt the cªfé
        I spent 20EUR At the cafe
        ### three.txt looks like UTF-8
        Ï spent 20€ Ãt the cªfé
        I spent 20EUR At the cafe
        
Re: Two octal values for eacute?
by Anonymous Monk on May 24, 2020 at 13:28 UTC

    Just for completeness' sake, in Unicode characters like e-acute can appear either composed (your \303\251) or decomposed (\145 \314 \201). This is actually two characters, LATIN SMALL LETTER E (the \145) and COMBINING ACUTE ACCENT (the \314 \201). If you run into this, Unicode::Normalize is your friend -- or will be once you read the file '<:encoding(utf-8)'.

      Belated thanks to all for the detailed explanations!