Re: UTF8 versus \w in pattern matching
by Corion (Patriarch) on Jul 06, 2021 at 09:31 UTC
|
Is your source file encoded as UTF-8?
Personally, I prefer to use charnames; and then to use \N{...} escapes in my source code for non-ASCII constants:
#!/usr/bin/perl
use strict;
use warnings;
use charnames ':full';
binmode STDOUT, ':utf8';
my $a;
$a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI
+TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W
+ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl";
print qq(1: ),$a,qq(\n);
($a) = ($a =~ m/^([\/\p{Word}]+)/);
print qq(2: ),$a,qq(\n);
This prints the following for me:
1: /i/áéíóúz/pl
2: /i/áéíóúz/pl
| [reply] [d/l] [select] |
|
|
Swapping out the value of $a with "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl" gives me the same output as before on my setups, both with use utf8; and without: the diacriticals are not matched by \w either way. If it matters, it is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-gnu-thread-multi on Ubuntu 21.04 on the one system and perl 5, version 28, subversion 1 (v5.28.1) built for arm-linux-gnueabihf-thread-multi-64int on Raspbian GNU/Linux 10 (buster).
| [reply] [d/l] [select] |
|
|
use strict;
use warnings;
use Test::More tests => 2;
my $str = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER
+ E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTE
+R O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}\N{LATIN SMALL LETT
+ER Y WITH ACUTE}z/pl";
my $re = qr/^([\/\p{Word}]+)/;
like $str, $re, 'Matched';
$str =~ $re;
is $1, $str, 'Capture group 1';
Both pass here on v5.20.3 x86_64-linux-thread-multi.
| [reply] [d/l] |
|
|
|
|
|
|
|
Is your source file encoded as UTF-8?
Yes, I am reading many UTF-8 files. As part of an earlier project, I have ensured that the input really is UTF-8. However, on two different systems, I get the problem that \w does not match any non-ASCII letters.
| [reply] [d/l] |
Re: UTF8 versus \w in pattern matching
by jo37 (Curate) on Jul 06, 2021 at 09:30 UTC
|
Unable to reproduce: With use utf8; the second line of output has all the diacritical characters - as well as the first line.
Greetings, -jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
| [reply] [d/l] |
Re: UTF8 versus \w in pattern matching (basic test)
by LanX (Saint) on Jul 06, 2021 at 11:02 UTC
|
use strict;
use warnings;
use Data::Dumper;
use utf8;
my $str = " 1 i á \x{3C3} _ "; # \x{3C3} = small sigma
warn Dumper $str;
$str =~ s/\w+//g; # delete all alpha-nums
warn Dumper $str;
warn "WORKS!" if $str =~ m/^ +$/;
C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/utf8.pl
$VAR1 = " 1 i \x{e1} \x{3c3} _ ";
$VAR1 = ' ';
WORKS! at d:/tmp/pm/utf8.pl line 12.
*) PM has problems displaying unicode characters like "σ" inside code tags
update
| [reply] [d/l] [select] |
|
|
Thanks. That snippet works, as-is, but still the text I am getting does not. The data is fetched over HTTP from WordPress. If I save the file and run the 'file' utility, I get the output "HTML document, UTF-8 Unicode text" for everything. Yet, when I process the file with perl, the \w pattern misses non-ASCII letters.
| [reply] [d/l] |
|
|
| [reply] |
|
|
| [reply] |
|
|
|
|
|
|
|
Re: UTF8 versus \w in pattern matching
by ikegami (Patriarch) on Jul 06, 2021 at 20:36 UTC
|
Problem #1: You didn't tell Perl the source file is encoded using use utf8;.
Problem #2: You didn't tell Perl how to encode the output for your terminal using something like use open ':std', ':encoding(UTF-8)';.
Finally, you mention \w. Because of a bug, \w doesn't always match characters in the U+7F..U+FF range. This bug is fixed with use 5.014;. That said, you actually used \p{Word}, which isn't affected by this bug.
Seeking work! You can reach me at ikegami@adaelis.com
| [reply] [d/l] [select] |
|
|
Thanks. That part about the U+7F..U+FF range explains things. I see that m/^([\/\-\_\.\p{Word}\x7f-\xff]+)$/ matches, and m/^([\/\-\_\.\p{Word}]+)$/ does not. I presume that is because the data upstream might really be ISO-8859-15 and not UTF-8? Should I try to convert the U+7F..U+FF range into UTF-8 before further processing? If so, how?
| [reply] [d/l] [select] |
|
|
# Sometimes \w matches U+E9.
$ perl -Mfeature=say -e'say "\xE9" =~ /^\w/ || 0'
0
# Sometimes it doesn't.
$ perl -Mfeature=say -e'say "\xE9\x{2660}" =~ /^\w/ || 0'
1
# \w always matches U+E9 with "use 5.014;".
$ perl -Mfeature=say -e'use 5.014; say "\xE9" =~ /^\w/ || 0'
1
# \p{Word} always matches U+E9, period.
$ perl -Mfeature=say -e'say "\xE9" =~ /^\p{Word}/ || 0'
1
Your questions make absolutely no sense since you're not using \w. And I said the fix for that bug was to add use 5.014;, not to convert the input.
Seeking work! You can reach me at ikegami@adaelis.com
| [reply] [d/l] [select] |