UTF8 versus \w in pattern matching

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: UTF8 versus \w in pattern matching by Corion (Patriarch) on Jul 06, 2021 at 09:31 UTC
Is your source file encoded as UTF-8? Personally, I prefer to `use charnames;` and then to use `\N{...}` escapes in my source code for non-ASCII constants: `#!/usr/bin/perl use strict; use warnings; use charnames ':full'; binmode STDOUT, ':utf8'; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print qq(1: ),$a,qq(\n); ($a) = ($a =~ m/^([\/\p{Word}]+)/); print qq(2: ),$a,qq(\n);` [download] This prints the following for me: `1: /i/бйнуъz/pl 2: /i/бйнуъz/pl` [download]	[reply] [d/l] [select]
Re^2: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 06, 2021 at 10:37 UTC
Swapping out the value of `$a` with `"/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"` gives me the same output as before on my setups, both with `use utf8;` and without: the diacriticals are not matched by `\w` either way. If it matters, it is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-gnu-thread-multi on Ubuntu 21.04 on the one system and perl 5, version 28, subversion 1 (v5.28.1) built for arm-linux-gnueabihf-thread-multi-64int on Raspbian GNU/Linux 10 (buster).	[reply] [d/l] [select]
Re^3: UTF8 versus \w in pattern matching by hippo (Archbishop) on Jul 06, 2021 at 10:57 UTC
Time for some tests, then: `use strict; use warnings; use Test::More tests => 2; my $str = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER + E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTE +R O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}\N{LATIN SMALL LETT +ER Y WITH ACUTE}z/pl"; my $re = qr/^([\/\p{Word}]+)/; like $str, $re, 'Matched'; $str =~ $re; is $1, $str, 'Capture group 1';` [download] Both pass here on v5.20.3 x86_64-linux-thread-multi. 🦛	[reply] [d/l]
Re^4: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 06, 2021 at 11:48 UTC
Re^5: UTF8 versus \w in pattern matching by hippo (Archbishop) on Jul 06, 2021 at 12:20 UTC
Some notes below your chosen depth have not been shown here
Re^2: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 06, 2021 at 10:26 UTC
Is your source file encoded as UTF-8? Yes, I am reading many UTF-8 files. As part of an earlier project, I have ensured that the input really is UTF-8. However, on two different systems, I get the problem that `\w` does not match any non-ASCII letters.	[reply] [d/l]
Re: UTF8 versus \w in pattern matching by jo37 (Curate) on Jul 06, 2021 at 09:30 UTC
Unable to reproduce: With `use utf8;` the second line of output has all the diacritical characters - as well as the first line. Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l]
Re: UTF8 versus \w in pattern matching (basic test) by LanX (Saint) on Jul 06, 2021 at 11:02 UTC
Works for me. I'd say your file's encoding is not what you think it is. `use strict; use warnings; use Data::Dumper; use utf8; my $str = " 1 i б \x{3C3} _ "; # \x{3C3} = small sigma warn Dumper $str; $str =~ s/\w+//g; # delete all alpha-nums warn Dumper $str; warn "WORKS!" if $str =~ m/^ +$/;` [download] `C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/utf8.pl $VAR1 = " 1 i \x{e1} \x{3c3} _ "; $VAR1 = ' '; WORKS! at d:/tmp/pm/utf8.pl line 12.` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} *) PM has problems displaying unicode characters like "σ" inside code tags update expanded code tests switched to core Data::Dumper	[reply] [d/l] [select]
Re^2: UTF8 versus \w in pattern matching (basic test) by mldvx4 (Hermit) on Jul 06, 2021 at 11:45 UTC
Thanks. That snippet works, as-is, but still the text I am getting does not. The data is fetched over HTTP from WordPress. If I save the file and run the 'file' utility, I get the output "HTML document, UTF-8 Unicode text" for everything. Yet, when I process the file with perl, the `\w` pattern misses non-ASCII letters.	[reply] [d/l]
Re^3: UTF8 versus \w in pattern matching (basic test) by haj (Vicar) on Jul 06, 2021 at 12:25 UTC
How do you fetch and process the file? Your original code example has no `use utf8;` and does not UTF-8-encode the output. You get your original string only because of a cancellation of errors: Your file is UTF-8-encoded but you don't declare this to Perl. Perl reads the individual bytes of the UTF-8-encoding which are no word characters and thus won't match `\w`. You just `print` the bytes. If you are using a UTF-8 terminal, this "works" because the terminal decodes your bytes. Perl's default encoding is not UTF-8. If you read the file and decode it from UTF-8 you should be fine. If you fetch with LWP, you can either print `$response->content` (without encoding it) or encode `$response->decoded_content` before printing.	[reply]
Re^3: UTF8 versus \w in pattern matching (basic test) by LanX (Saint) on Jul 06, 2021 at 12:56 UTC
Please use Data::Dumper for basic debugging, like demonstrated. Check your input, output and code. We can't do this for you ... Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^4: UTF8 versus \w in pattern matching (basic test) by mldvx4 (Hermit) on Jul 06, 2021 at 13:03 UTC
Re^5: UTF8 versus \w in pattern matching (basic test) by LanX (Saint) on Jul 06, 2021 at 15:23 UTC
Re^5: UTF8 versus \w in pattern matching (basic test) by jo37 (Curate) on Jul 06, 2021 at 16:18 UTC
Some notes below your chosen depth have not been shown here
Re: UTF8 versus \w in pattern matching by ikegami (Patriarch) on Jul 06, 2021 at 20:36 UTC
Problem #1: You didn't tell Perl the source file is encoded using `use utf8;`. Problem #2: You didn't tell Perl how to encode the output for your terminal using something like `use open ':std', ':encoding(UTF-8)';`. Finally, you mention `\w`. Because of a bug, `\w` doesn't always match characters in the U+7F..U+FF range. This bug is fixed with `use 5.014;`. That said, you actually used `\p{Word}`, which isn't affected by this bug. Seeking work! You can reach me at ikegami@adaelis.com	[reply] [d/l] [select]
Re^2: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 07, 2021 at 03:45 UTC
Thanks. That part about the U+7F..U+FF range explains things. I see that `m/^([\/\-\_\.\p{Word}\x7f-\xff]+)$/` matches, and `m/^([\/\-\_\.\p{Word}]+)$/` does not. I presume that is because the data upstream might really be ISO-8859-15 and not UTF-8? Should I try to convert the U+7F..U+FF range into UTF-8 before further processing? If so, how?	[reply] [d/l] [select]
Re^3: UTF8 versus \w in pattern matching by ikegami (Patriarch) on Jul 11, 2021 at 05:12 UTC
Read again. The part about U+7F..U+FF applies to `\w`, but not to `\p{Word}`. `# Sometimes \w matches U+E9. $ perl -Mfeature=say -e'say "\xE9" =~ /^\w/ \|\| 0' 0 # Sometimes it doesn't. $ perl -Mfeature=say -e'say "\xE9\x{2660}" =~ /^\w/ \|\| 0' 1 # \w always matches U+E9 with "use 5.014;". $ perl -Mfeature=say -e'use 5.014; say "\xE9" =~ /^\w/ \|\| 0' 1 # \p{Word} always matches U+E9, period. $ perl -Mfeature=say -e'say "\xE9" =~ /^\p{Word}/ \|\| 0' 1` [download] Your questions make absolutely no sense since you're not using `\w`. And I said the fix for that bug was to add `use 5.014;`, not to convert the input. Seeking work! You can reach me at ikegami@adaelis.com	[reply] [d/l] [select]

update