Re^2: UTF8 versus \w in pattern matching (basic test)

Replies are listed 'Best First'.
Re^3: UTF8 versus \w in pattern matching (basic test) by haj (Vicar) on Jul 06, 2021 at 12:25 UTC
How do you fetch and process the file? Your original code example has no `use utf8;` and does not UTF-8-encode the output. You get your original string only because of a cancellation of errors: Your file is UTF-8-encoded but you don't declare this to Perl. Perl reads the individual bytes of the UTF-8-encoding which are no word characters and thus won't match `\w`. You just `print` the bytes. If you are using a UTF-8 terminal, this "works" because the terminal decodes your bytes. Perl's default encoding is not UTF-8. If you read the file and decode it from UTF-8 you should be fine. If you fetch with LWP, you can either print `$response->content` (without encoding it) or encode `$response->decoded_content` before printing.	[reply]
Re^3: UTF8 versus \w in pattern matching (basic test) by LanX (Saint) on Jul 06, 2021 at 12:56 UTC
Please use Data::Dumper for basic debugging, like demonstrated. Check your input, output and code. We can't do this for you ... Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^4: UTF8 versus \w in pattern matching (basic test) by mldvx4 (Hermit) on Jul 06, 2021 at 13:03 UTC
Using `Data::Dumper` in the following, `use utf8; use Data::Dumper; use strict; use warnings; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print Dumper($a);` [download] I get this output: `$VAR1 = "/i/\x{e1}\x{e9}\x{ed}\x{f3}\x{fa}z/pl";` [download]	[reply] [d/l] [select]
Re^5: UTF8 versus \w in pattern matching (basic test) by LanX (Saint) on Jul 06, 2021 at 15:23 UTC
So that looks correct ... ...but as I said your input and output too. If you've not "fetched" the web data correctly, it will show in the dump. That's basic debugging! Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^5: UTF8 versus \w in pattern matching (basic test) by jo37 (Curate) on Jul 06, 2021 at 16:18 UTC
The `Dumper` output shows an encoding in ISO 8859-1, not UTF-8. That's strange. Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l]
Re^6: UTF8 versus \w in pattern matching (basic test) by haj (Vicar) on Jul 06, 2021 at 17:54 UTC
Re^7: UTF8 versus \w in pattern matching (basic test) by jo37 (Curate) on Jul 06, 2021 at 18:03 UTC
Re^6: UTF8 versus \w in pattern matching (basic test) by ikegami (Patriarch) on Jul 06, 2021 at 21:07 UTC