Re: UTF8 versus \w in pattern matching (basic test)

Works for me.

I'd say your file's encoding is not what you think it is.

use strict;
use warnings;
use Data::Dumper;
use utf8;

my $str = " 1 i á \x{3C3} _ ";          # \x{3C3} = small sigma 
warn Dumper $str;

$str =~ s/\w+//g;                       # delete all alpha-nums
warn Dumper $str;

warn "WORKS!" if $str =~ m/^ +$/;
[download]

C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/utf8.pl $VAR1 = " 1 i \x{e1} \x{3c3} _ "; $VAR1 = ' '; WORKS! at d:/tmp/pm/utf8.pl line 12.
[download]

Cheers Rolf
_{(addicted to the Perl Programming Language :)

Wikisyntax for the Monastery}

*) PM has problems displaying unicode characters like "σ" inside code tags

update

expanded code tests
switched to core Data::Dumper

Comment on Re: UTF8 versus \w in pattern matching (basic test) Select or Download Code

Replies are listed 'Best First'.
Re^2: UTF8 versus \w in pattern matching (basic test) by mldvx4 (Hermit) on Jul 06, 2021 at 11:45 UTC
Thanks. That snippet works, as-is, but still the text I am getting does not. The data is fetched over HTTP from WordPress. If I save the file and run the 'file' utility, I get the output "HTML document, UTF-8 Unicode text" for everything. Yet, when I process the file with perl, the `\w` pattern misses non-ASCII letters.	[reply] [d/l]
Re^3: UTF8 versus \w in pattern matching (basic test) by haj (Vicar) on Jul 06, 2021 at 12:25 UTC
How do you fetch and process the file? Your original code example has no `use utf8;` and does not UTF-8-encode the output. You get your original string only because of a cancellation of errors: Your file is UTF-8-encoded but you don't declare this to Perl. Perl reads the individual bytes of the UTF-8-encoding which are no word characters and thus won't match `\w`. You just `print` the bytes. If you are using a UTF-8 terminal, this "works" because the terminal decodes your bytes. Perl's default encoding is not UTF-8. If you read the file and decode it from UTF-8 you should be fine. If you fetch with LWP, you can either print `$response->content` (without encoding it) or encode `$response->decoded_content` before printing.	[reply]
Re^3: UTF8 versus \w in pattern matching (basic test) by LanX (Saint) on Jul 06, 2021 at 12:56 UTC
Please use Data::Dumper for basic debugging, like demonstrated. Check your input, output and code. We can't do this for you ... Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^4: UTF8 versus \w in pattern matching (basic test) by mldvx4 (Hermit) on Jul 06, 2021 at 13:03 UTC
Using `Data::Dumper` in the following, `use utf8; use Data::Dumper; use strict; use warnings; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print Dumper($a);` [download] I get this output: `$VAR1 = "/i/\x{e1}\x{e9}\x{ed}\x{f3}\x{fa}z/pl";` [download]	[reply] [d/l] [select]
Re^5: UTF8 versus \w in pattern matching (basic test) by LanX (Saint) on Jul 06, 2021 at 15:23 UTC
Re^5: UTF8 versus \w in pattern matching (basic test) by jo37 (Curate) on Jul 06, 2021 at 16:18 UTC
Re^6: UTF8 versus \w in pattern matching (basic test) by haj (Vicar) on Jul 06, 2021 at 17:54 UTC
Some notes below your chosen depth have not been shown here
Re^6: UTF8 versus \w in pattern matching (basic test) by ikegami (Patriarch) on Jul 06, 2021 at 21:07 UTC