in reply to Re^2: UTF8 versus \w in pattern matching
in thread UTF8 versus \w in pattern matching

Time for some tests, then:

use strict; use warnings; use Test::More tests => 2; my $str = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER + E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTE +R O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}\N{LATIN SMALL LETT +ER Y WITH ACUTE}z/pl"; my $re = qr/^([\/\p{Word}]+)/; like $str, $re, 'Matched'; $str =~ $re; is $1, $str, 'Capture group 1';

Both pass here on v5.20.3 x86_64-linux-thread-multi.


🦛

Replies are listed 'Best First'.
Re^4: UTF8 versus \w in pattern matching
by mldvx4 (Hermit) on Jul 06, 2021 at 11:48 UTC

    Thanks. Both pass here as well.

    1..2 ok 1 - Matched ok 2 - Capture group 1

    I've double-checked the source files, which are from WordPress via HTTP, and they get identified by the 'file' utility as "HTML document, UTF-8 Unicode text".

      So there's nothing wrong with your version of perl and it correctly matches the UTF-8 accented characters with \p{Word}, and presumably also with \w if you change the value of $re thus: my $re = qr/^([\/\w]+)/;

      Are you definitely decoding the contents of these files when you read them in your perl script?

      Might also be worth checking the actual data in the data files with eg. hexdump.


      🦛

        Using the formula my $re = qr/^([\/\w]+)/; as the pattern has the same problems. I am quite sure that the input files are UTF-8. However, checking in different terminals, the script renders properly if I change the terminal to ISO-8859-15 away from UTF-8, even with \N{LATIN SMALL LETTER A WITH ACUTE} for the letters. So this may be a terminal problem, except I really wonder why the script, which has \N{LATIN SMALL LETTER A WITH ACUTE} is still outputting ISO-8859-15 instead of UTF-8.