in reply to UTF8 versus \w in pattern matching

Works for me.

I'd say your file's encoding is not what you think it is.

use strict; use warnings; use Data::Dumper; use utf8; my $str = " 1 i á \x{3C3} _ "; # \x{3C3} = small sigma warn Dumper $str; $str =~ s/\w+//g; # delete all alpha-nums warn Dumper $str; warn "WORKS!" if $str =~ m/^ +$/;

C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/utf8.pl $VAR1 = " 1 i \x{e1} \x{3c3} _ "; $VAR1 = ' '; WORKS! at d:/tmp/pm/utf8.pl line 12.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

*) PM has problems displaying unicode characters like "σ" inside code tags

update

Replies are listed 'Best First'.
Re^2: UTF8 versus \w in pattern matching (basic test)
by mldvx4 (Hermit) on Jul 06, 2021 at 11:45 UTC

    Thanks. That snippet works, as-is, but still the text I am getting does not. The data is fetched over HTTP from WordPress. If I save the file and run the 'file' utility, I get the output "HTML document, UTF-8 Unicode text" for everything. Yet, when I process the file with perl, the \w pattern misses non-ASCII letters.

      How do you fetch and process the file? Your original code example has no use utf8; and does not UTF-8-encode the output. You get your original string only because of a cancellation of errors:
      • Your file is UTF-8-encoded but you don't declare this to Perl. Perl reads the individual bytes of the UTF-8-encoding which are no word characters and thus won't match \w.
      • You just print the bytes. If you are using a UTF-8 terminal, this "works" because the terminal decodes your bytes.

      Perl's default encoding is not UTF-8. If you read the file and decode it from UTF-8 you should be fine. If you fetch with LWP, you can either print $response->content (without encoding it) or encode $response->decoded_content before printing.

      Please use Data::Dumper for basic debugging, like demonstrated.

      Check your input, output and code.

      We can't do this for you ...

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Using Data::Dumper in the following,
        use utf8; use Data::Dumper; use strict; use warnings; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print Dumper($a);
        I get this output:
        $VAR1 = "/i/\x{e1}\x{e9}\x{ed}\x{f3}\x{fa}z/pl";