in reply to Re: UTF8 versus \w in pattern matching (basic test)
in thread UTF8 versus \w in pattern matching

Thanks. That snippet works, as-is, but still the text I am getting does not. The data is fetched over HTTP from WordPress. If I save the file and run the 'file' utility, I get the output "HTML document, UTF-8 Unicode text" for everything. Yet, when I process the file with perl, the \w pattern misses non-ASCII letters.

Replies are listed 'Best First'.
Re^3: UTF8 versus \w in pattern matching (basic test)
by haj (Vicar) on Jul 06, 2021 at 12:25 UTC
    How do you fetch and process the file? Your original code example has no use utf8; and does not UTF-8-encode the output. You get your original string only because of a cancellation of errors:
    • Your file is UTF-8-encoded but you don't declare this to Perl. Perl reads the individual bytes of the UTF-8-encoding which are no word characters and thus won't match \w.
    • You just print the bytes. If you are using a UTF-8 terminal, this "works" because the terminal decodes your bytes.

    Perl's default encoding is not UTF-8. If you read the file and decode it from UTF-8 you should be fine. If you fetch with LWP, you can either print $response->content (without encoding it) or encode $response->decoded_content before printing.

Re^3: UTF8 versus \w in pattern matching (basic test)
by LanX (Saint) on Jul 06, 2021 at 12:56 UTC
    Please use Data::Dumper for basic debugging, like demonstrated.

    Check your input, output and code.

    We can't do this for you ...

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Using Data::Dumper in the following,
      use utf8; use Data::Dumper; use strict; use warnings; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print Dumper($a);
      I get this output:
      $VAR1 = "/i/\x{e1}\x{e9}\x{ed}\x{f3}\x{fa}z/pl";
        So that looks correct ...

        ...but as I said your input and output too.

        If you've not "fetched" the web data correctly, it will show in the dump.

        That's basic debugging!

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        The Dumper output shows an encoding in ISO 8859-1, not UTF-8. That's strange.

        Greetings,
        -jo

        $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$