Re: UTF8 versus \w in pattern matching

Is your source file encoded as UTF-8?

Personally, I prefer to use charnames; and then to use \N{...} escapes in my source code for non-ASCII constants:

#!/usr/bin/perl

use strict;
use warnings;
use charnames ':full';

binmode STDOUT, ':utf8';

my $a;

$a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI
+TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W
+ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl";

print qq(1: ),$a,qq(\n);

($a) = ($a =~ m/^([\/\p{Word}]+)/);

print qq(2: ),$a,qq(\n);
[download]

This prints the following for me:

1: /i/硅炫偂/pl
2: /i/硅炫偂/pl
[download]

Comment on Re: UTF8 versus \w in pattern matching Select or Download Code

Replies are listed 'Best First'.
Re^2: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 06, 2021 at 10:37 UTC
Swapping out the value of `$a` with `"/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"` gives me the same output as before on my setups, both with `use utf8;` and without: the diacriticals are not matched by `\w` either way. If it matters, it is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-gnu-thread-multi on Ubuntu 21.04 on the one system and perl 5, version 28, subversion 1 (v5.28.1) built for arm-linux-gnueabihf-thread-multi-64int on Raspbian GNU/Linux 10 (buster).	[reply] [d/l] [select]
Re^3: UTF8 versus \w in pattern matching by hippo (Archbishop) on Jul 06, 2021 at 10:57 UTC
Time for some tests, then: `use strict; use warnings; use Test::More tests => 2; my $str = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER + E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTE +R O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}\N{LATIN SMALL LETT +ER Y WITH ACUTE}z/pl"; my $re = qr/^([\/\p{Word}]+)/; like $str, $re, 'Matched'; $str =~ $re; is $1, $str, 'Capture group 1';` [download] Both pass here on v5.20.3 x86_64-linux-thread-multi. 🦛	[reply] [d/l]
Re^4: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 06, 2021 at 11:48 UTC
Thanks. Both pass here as well. `1..2 ok 1 - Matched ok 2 - Capture group 1` [download] I've double-checked the source files, which are from WordPress via HTTP, and they get identified by the 'file' utility as "HTML document, UTF-8 Unicode text".	[reply] [d/l]
Re^5: UTF8 versus \w in pattern matching by hippo (Archbishop) on Jul 06, 2021 at 12:20 UTC
Re^6: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 06, 2021 at 12:33 UTC
Some notes below your chosen depth have not been shown here
Re^2: UTF8 versus \w in pattern matching by mldvx4 (Hermit) on Jul 06, 2021 at 10:26 UTC
Is your source file encoded as UTF-8? Yes, I am reading many UTF-8 files. As part of an earlier project, I have ensured that the input really is UTF-8. However, on two different systems, I get the problem that `\w` does not match any non-ASCII letters.	[reply] [d/l]