Re^8: UTF8 versus \w in pattern matching

I've been trying for a real SSCCE. Here's one more try: When I fetch one of the source files using 'curl' directly to a file, and then import that file using Emacs, whittle it down to a few letters, like in the following, then I get the output $VAR1 = "t\x{f3}n";. That does not look like UTF-8 to me.

#!/usr/bin/perl

use utf8;
use Data::Dumper;
use warnings;
use strict;

my $a = "tón";

print Dumper($a),qq(\n);
[download]

Is there a standard way to identify 8-bit, legacy text (which has been mislabeled upstream as UTF-8) and convert it into UTF-8 for continued work with regex?

Comment on Re^8: UTF8 versus \w in pattern matching Select or Download Code

Replies are listed 'Best First'.
Re^9: UTF8 versus \w in pattern matching by haj (Vicar) on Jul 06, 2021 at 18:21 UTC
It doesn't look like UTF-8 because it isn't supposed to look like UTF-8. It has nothing to do with legacy 8-bit. Data::Dumper shows Unicode codepoints and not encodings. If you open the file in Emacs, it will use your preferred coding set to interpret the data, this is UTF-8 for current Emacsen. However, Emacs will fall back to ISO-8859-1 if the file doesn't contain valid UTF-8. Look at the Emacs modeline: If the first character is `U`, then it is UTF-8, if it is `1`, then it is ISO-8859-1. You can enforce the encoding in Emacs with `C-x RET f ISO-8859-1 RET`. If you execute the file in this encoding, Perl will croak because you said `use utf8;` and your source code isn't valid UTF-8. If you then omit `use utf8;` with ISO-8859-1 encoding and run the file, you'll get `$VAR1 = 't�n';` because now it is your Terminal which expects UTF-8 and gets an 8-bit character. If you then add `use Encode;` and change the last line to `print encode('UTF-8',Dumper($a));` (like you should when using an UTF_8 terminal), then you'll get `$VAR1 = 'tón';` I don't recommend Data::Dumper for such diagnostics because it might, or might not use `\x{}` notation, as you just saw. It isn't easy, but it is rather straightforward if you keep track of the different places where encoding might occur.	[reply]
Re^10: UTF8 versus \w in pattern matching by pryrt (Abbot) on Jul 06, 2021 at 18:49 UTC
If you then add `use Encode`; and change the last line to `print encode('UTF-8',Dumper($a));` (like you should when using an UTF_8 terminal), then you'll get `$VAR1 = 'tón';` Assuming the real code is going to use more than one print statement, this suggestion will require calling `encode()` for every print, which is not DRY programming. Alternative: use the binmode function, as `binmode STDOUT, ':encoding(UTF-8)';` , sometime before any print statements, and just use normal print statements (like `print Dumper($a);`) throughout. This lets the I/O layer handle the translation from Perl's internal representation to UTF-8-encoded output.	[reply] [d/l] [select]
Re^11: UTF8 versus \w in pattern matching by ikegami (Patriarch) on Jul 06, 2021 at 21:01 UTC
Too many moving parts!!! One should be using the following here: `local $Data::Dumper::Useqq = 1; print(Dumper($a));` [download] Fix the problems until you get the correct string (one that contains `"\x{e9}"` or `"\351"` for "é"). Then worry about the output to the terminal. Seeking work! You can reach me at ikegami@adaelis.com	[reply] [d/l] [select]