Re^2: UTF8 versus \w in pattern matching

Swapping out the value of $a with "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl" gives me the same output as before on my setups, both with use utf8; and without: the diacriticals are not matched by \w either way. If it matters, it is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-gnu-thread-multi on Ubuntu 21.04 on the one system and perl 5, version 28, subversion 1 (v5.28.1) built for arm-linux-gnueabihf-thread-multi-64int on Raspbian GNU/Linux 10 (buster).

Comment on Re^2: UTF8 versus \w in pattern matching Select or Download Code

Replies are listed 'Best First'.
Re^3: UTF8 versus \w in pattern matching by hippo (Archbishop) on Jul 06, 2021 at 10:57 UTC
Time for some tests, then: `use strict; use warnings; use Test::More tests => 2; my $str = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER + E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTE +R O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}\N{LATIN SMALL LETT +ER Y WITH ACUTE}z/pl"; my $re = qr/^([\/\p{Word}]+)/; like $str, $re, 'Matched'; $str =~ $re; is $1, $str, 'Capture group 1';` [download] Both pass here on v5.20.3 x86_64-linux-thread-multi. 🦛	[reply] [d/l]
Re^4: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 06, 2021 at 11:48 UTC
Thanks. Both pass here as well. `1..2 ok 1 - Matched ok 2 - Capture group 1` [download] I've double-checked the source files, which are from WordPress via HTTP, and they get identified by the 'file' utility as "HTML document, UTF-8 Unicode text".	[reply] [d/l]
Re^5: UTF8 versus \w in pattern matching by hippo (Archbishop) on Jul 06, 2021 at 12:20 UTC
So there's nothing wrong with your version of perl and it correctly matches the UTF-8 accented characters with `\p{Word}`, and presumably also with `\w` if you change the value of $re thus: `my $re = qr/^([\/\w]+)/;` Are you definitely decoding the contents of these files when you read them in your perl script? Might also be worth checking the actual data in the data files with eg. hexdump. 🦛	[reply] [d/l] [select]
Re^6: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 06, 2021 at 12:33 UTC
Re^7: UTF8 versus \w in pattern matching by hippo (Archbishop) on Jul 06, 2021 at 12:47 UTC
Some notes below your chosen depth have not been shown here