in reply to UTF8 versus \w in pattern matching

Is your source file encoded as UTF-8?

Personally, I prefer to use charnames; and then to use \N{...} escapes in my source code for non-ASCII constants:

#!/usr/bin/perl use strict; use warnings; use charnames ':full'; binmode STDOUT, ':utf8'; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print qq(1: ),$a,qq(\n); ($a) = ($a =~ m/^([\/\p{Word}]+)/); print qq(2: ),$a,qq(\n);

This prints the following for me:

1: /i/áéíóúz/pl 2: /i/áéíóúz/pl

Replies are listed 'Best First'.
Re^2: UTF8 versus \w in pattern matching
by mldvx4 (Hermit) on Jul 06, 2021 at 10:37 UTC

    Swapping out the value of $a with "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl" gives me the same output as before on my setups, both with use utf8; and without: the diacriticals are not matched by \w either way. If it matters, it is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-gnu-thread-multi on Ubuntu 21.04 on the one system and perl 5, version 28, subversion 1 (v5.28.1) built for arm-linux-gnueabihf-thread-multi-64int on Raspbian GNU/Linux 10 (buster).

      Time for some tests, then:

      use strict; use warnings; use Test::More tests => 2; my $str = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER + E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTE +R O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}\N{LATIN SMALL LETT +ER Y WITH ACUTE}z/pl"; my $re = qr/^([\/\p{Word}]+)/; like $str, $re, 'Matched'; $str =~ $re; is $1, $str, 'Capture group 1';

      Both pass here on v5.20.3 x86_64-linux-thread-multi.


      🦛

        Thanks. Both pass here as well.

        1..2 ok 1 - Matched ok 2 - Capture group 1

        I've double-checked the source files, which are from WordPress via HTTP, and they get identified by the 'file' utility as "HTML document, UTF-8 Unicode text".

Re^2: UTF8 versus \w in pattern matching
by mldvx4 (Hermit) on Jul 06, 2021 at 10:26 UTC
    Is your source file encoded as UTF-8?

    Yes, I am reading many UTF-8 files. As part of an earlier project, I have ensured that the input really is UTF-8. However, on two different systems, I get the problem that \w does not match any non-ASCII letters.