in reply to Re^7: UTF8 versus \w in pattern matching
in thread UTF8 versus \w in pattern matching

I've been trying for a real SSCCE. Here's one more try: When I fetch one of the source files using 'curl' directly to a file, and then import that file using Emacs, whittle it down to a few letters, like in the following, then I get the output $VAR1 = "t\x{f3}n";. That does not look like UTF-8 to me.

#!/usr/bin/perl use utf8; use Data::Dumper; use warnings; use strict; my $a = "tón"; print Dumper($a),qq(\n);

Is there a standard way to identify 8-bit, legacy text (which has been mislabeled upstream as UTF-8) and convert it into UTF-8 for continued work with regex?

Replies are listed 'Best First'.
Re^9: UTF8 versus \w in pattern matching
by haj (Vicar) on Jul 06, 2021 at 18:21 UTC

    It doesn't look like UTF-8 because it isn't supposed to look like UTF-8.

    It has nothing to do with legacy 8-bit.

    Data::Dumper shows Unicode codepoints and not encodings.

    If you open the file in Emacs, it will use your preferred coding set to interpret the data, this is UTF-8 for current Emacsen. However, Emacs will fall back to ISO-8859-1 if the file doesn't contain valid UTF-8. Look at the Emacs modeline: If the first character is U, then it is UTF-8, if it is 1, then it is ISO-8859-1.

    You can enforce the encoding in Emacs with C-x RET f ISO-8859-1 RET. If you execute the file in this encoding, Perl will croak because you said use utf8; and your source code isn't valid UTF-8.

    If you then omit use utf8; with ISO-8859-1 encoding and run the file, you'll get $VAR1 = 't�n'; because now it is your Terminal which expects UTF-8 and gets an 8-bit character.

    If you then add use Encode; and change the last line to print encode('UTF-8',Dumper($a)); (like you should when using an UTF_8 terminal), then you'll get $VAR1 = 'tón';

    I don't recommend Data::Dumper for such diagnostics because it might, or might not use \x{} notation, as you just saw. It isn't easy, but it is rather straightforward if you keep track of the different places where encoding might occur.

      If you then add use Encode; and change the last line to print encode('UTF-8',Dumper($a)); (like you should when using an UTF_8 terminal), then you'll get $VAR1 = 'tón';

      Assuming the real code is going to use more than one print statement, this suggestion will require calling encode() for every print, which is not DRY programming. Alternative: use the binmode function, as binmode STDOUT, ':encoding(UTF-8)'; , sometime before any print statements, and just use normal print statements (like print Dumper($a);) throughout. This lets the I/O layer handle the translation from Perl's internal representation to UTF-8-encoded output.

        Too many moving parts!!! One should be using the following here:

        local $Data::Dumper::Useqq = 1; print(Dumper($a));

        Fix the problems until you get the correct string (one that contains "\x{e9}" or "\351" for "é"). Then worry about the output to the terminal.

        Seeking work! You can reach me at ikegami@adaelis.com