Unexpected interaction between decode

kurisuto has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 14, 2017 at 13:55 UTC
It took a little editing but I was able to reproduce this (Linux / Perl 5.26). Note that AFAIK PerlMonks defaults to the Latin1 character set encoding, so as opposed to the file you get from the `[download]` link, I am assuming that your source code file is correctly encoded in valid UTF-8 and contains the correctly encoded characters "É" and "é". In my post below I am using the `<pre>` instead of `<code>` tags (and a few other replacements) in order to get PerlMonks to display the post correctly. For posting on PerlMonks, personally I would not `use utf8;` and use `\x{00C9}` escapes instead, although then again, if I do that with your code, none of the input strings are flagged as UTF-8. (Update: With `\N{U+00C9}`, they do get flagged as UTF8.) It appears that HTML::Entities's `decode_entities` does not enable the UTF8 flag on the string, so that the non-UTF8 string `"Édition limitée."` does not get upgraded. #!/usr/bin/env perl use warnings; use strict; use utf8; use open qw/:std :utf8/; use HTML::Entities; use Devel::Peek; my @strs = ("Édition limitÉe.", "Édition limitÉe.", "Édition limitÉe."); for my $str (@strs) { $str = decode_entities($str); Dump($str); $str = lc($str); Dump($str); print "{$str}\n"; } __END__ # Output edited for brevity SV = ... FLAGS = (POK,pPOK,UTF8) PV = 0xf667c0 "\303\211dition limit\303\211e."\0 [UTF8 "\x{c9}dition limit\x{c9}e."] SV = ... FLAGS = (POK,pPOK,UTF8) PV = 0xf667c0 "\303\251dition limit\303\251e."\0 [UTF8 "\x{e9}dition limit\x{e9}e."] {édition limitée.} SV = ... FLAGS = (POK,pPOK,UTF8) PV = 0xffde20 "\303\211dition limit\303\211e."\0 [UTF8 "\x{c9}dition limit\x{c9}e."] SV = ... FLAGS = (POK,pPOK,UTF8) PV = 0xffde20 "\303\251dition limit\303\251e."\0 [UTF8 "\x{e9}dition limit\x{e9}e."] {édition limitée.} SV = ... FLAGS = (POK,pPOK) PV = 0xfe3440 "\311dition limit\311e."\0 SV = ... FLAGS = (POK,pPOK) PV = 0xfe3440 "\311dition limit\311e."\0 {Édition limitÉe.} The "correct" way to solve this depends a bit on where these strings you are getting are coming from. Are they all embedded in your source code? Are you reading them from a file? If so, are you opening the files with the correct layers, that is for exaple, `open my $fh, '<:encoding(UTF-8)', ...`? Update: In the code, changed all `é` to `É` and `é` to `É` to make the effects more clear. (Also, I can't reproduce the dependence on the `-w` switch that 1nickt reported.)	[reply] [d/l] [select]
Re^2: Unexpected interaction between decode_entities() and lc() by kurisuto (Novice) on Nov 14, 2017 at 15:23 UTC
My real intent is to read from a file, and that's where I first noticed the problem. I made the script above to test and illustrate the problem. You said you're assuming that my script source file is UTF-8 encoded, and yes, that's the case. Following is a tiny script which reads from a file and which has the same problem: `#!/usr/bin/perl5.16.3 + use strict; use HTML::Entities; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while(<>) { utf8::decode($_); chomp; $_ = decode_entities($_); $_ = lc($_); print $_, "\n"; }` [download] Here are the contents of my test input file: Édition limitée. Édition limitée. É still ends up as É, not é as intended. I think I'm doing what I'm supposed to do in terms of enabling the UTF-8 flag on $_, but please let me know if I've missed something.	[reply] [d/l]
Re^3: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 14, 2017 at 15:50 UTC
First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings. Second, it seems binmodeing `STDIN` is not enough - instead, try the open pragma with the `:std` option `<update>` since that also sets the default open modes and works on files passed on the command line as well `</update>`. The following works for me, both for a UTF-8 file piped into `perl` and a file listed on the command line: `use warnings; use strict; use open qw/:std :utf8/; use HTML::Entities; while(<>) { chomp; print lc(decode_entities($_)), "\n"; }` [download] Update: Added a missing "not", oops :-)	[reply] [d/l] [select]
Re^4: Unexpected interaction between decode_entities() and lc() by ikegami (Patriarch) on Nov 16, 2017 at 19:00 UTC
Re^5: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 16, 2017 at 19:46 UTC
Re^3: Unexpected interaction between decode_entities() and lc() by hippo (Archbishop) on Nov 14, 2017 at 15:44 UTC
The encoding layer is already handling that, so forget about the `utf8::decode($_);` line and it all just works: $ cat uct.pl #!/usr/bin/perl5.16.3 use strict; use HTML::Entities; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while(<>) { chomp; $_ = decode_entities($_); $_ = lc($_); print $_, "\n"; } $ echo -e "Édition limitée.\nÉdition limitée." \| perl uct.pl édition limitée. édition limitée. Update: forgot to mention: this is on perl 5.20.3 regardless of your #! line.	[reply] [d/l]
Re^4: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 14, 2017 at 15:53 UTC
Re^5: Unexpected interaction between decode_entities() and lc() by choroba (Cardinal) on Nov 14, 2017 at 16:51 UTC
Re: Unexpected interaction between decode_entities() and lc() by 1nickt (Canon) on Nov 14, 2017 at 13:35 UTC
Update: removing the `-w` flag from your shebang line fixes the problem. I cannot say what the issue is that is triggered by it, but in general you should not use `-w` in your program as it turns on all warnings in all code pulled into it. To limit to your own code, use `use warnings;` instead. Hope this helps! Orignal reply below: Hi, I can't reproduce your results on my system. `$ perl 1203368.pl édition limitée. édition limitée.` [download] `This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-li +nux` [download] The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^2: Unexpected interaction between decode_entities() and lc() by kurisuto (Novice) on Nov 14, 2017 at 13:44 UTC
I tried removing -w just now, and got the same output: édition limitée. Édition limitée. I'm not sure why -w would affect the behavior of decode_entities() or lc() (other than maybe printing warnings). I wasn't getting any warnings one way or the other; just unexpected output.	[reply]
Re^3: Unexpected interaction between decode_entities() and lc() by 1nickt (Canon) on Nov 14, 2017 at 14:16 UTC
I'm not sure either. But strange things like that do happen. More info: What's wrong with -w and $^W from the Perl documentation. The way forward always starts with a minimal test.	[reply]
Re: Unexpected interaction between decode_entities() and lc() by vr (Curate) on Nov 14, 2017 at 16:48 UTC
I think there's missing bit in previous answers, i.e. the `lc` only follows the documentation: If `use bytes` is in effect: The results follow ASCII rules. Only the characters A-Z change, to a-z respectively. (Emphasis mine.) Since neither `"use feature 'unicode_strings';"`, nor e.g. `"use 5.016;"` was declared, then `lc` does exactly as described above. BTW I'm impressed with the `decode_entities` clever behavior i.e. output depending on utf8 flag of its argument.	[reply] [d/l] [select]
Re^2: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 15, 2017 at 09:31 UTC
`use feature 'unicode_strings';` is an excellent point! Since neither `"use feature 'unicode_strings';"`, nor e.g. `"use 5.016;"` was declared, then lc does exactly as described above. If by "described above" you mean "Only the characters A-Z change, to a-z respectively.", then I think your reading of the lc docs might be a little off, my understanding is that bytes is not the default behavior. The following test took a little fiddling to get the right values but it passes on all Perl releases starting with 5.8.1, 5.8.9, 5.10.1, up to 5.26 and shows the differences: use warnings; use strict; use utf8; use Test::More; diag explain "Perl $]"; plan tests => $] ge '5.012' ? 15 : 11; SKIP: { is "\N{U+00C9}", "É", '\N{U+...} escape'; skip 'Perl ge 5.12 required', 1 unless $] ge '5.012'; ok utf8::is_utf8("\N{U+00C9}"), '\N{U+...} sets UTF8'; } { ok !utf8::is_utf8("\x{C9}"), '\x doesn\'t set UTF8'; is lc("\x{C9}"), "\xC9", 'lc on non-UTF8 str'; ok utf8::is_utf8("É"), 'str is UTF8'; is lc("É"), "é", 'lc on UTF8 str'; } { use bytes; ok !utf8::is_utf8("\x{C9}"), 'bytes: \x doesn\'t set UTF8'; is lc("\x{C9}"), "\xC9", 'bytes: lc on non-UTF8 str'; ok utf8::is_utf8("É"), 'bytes: str is UTF8'; is lc("É"), $] lt '5.008009' ? "\xC9" : "\xC3\x89", 'bytes: lc on UTF8 str'; } SKIP: { skip 'Perl ge 5.12 required', 1 unless $] ge '5.012'; ok eval q{ do { use feature 'unicode_strings'; ok !utf8::is_utf8("\x{C9}"), 'u_s: \x doesn\'t set UTF8'; is lc("\x{C9}"), "é", 'u_s: lc on non-UTF8 str'; ok utf8::is_utf8("É"), 'u_s: str is UTF8'; is lc("É"), "é", 'u_s: lc on UTF8 str'; 1 } }, 'unicode_strings works' or warn $@ }	[reply] [d/l] [select]
Re^3: Unexpected interaction between decode_entities() and lc() by vr (Curate) on Nov 15, 2017 at 11:00 UTC
You are right, I was too quick to paste a quote from lc documentation page, it should have been the last, fall-through case. Also, kurisuto, my comment was not a solution, rather an attempt at explanation (to myself) of what was happening -- too rarely I deal with extended-ASCII, and yet not-utf8 strings. Proper fix (at least, for anything but one-time scripts) would be to always explicitly decode inputs from all sources, not hoping them to be Latin-1 only, and Perl silently doing "the right thing" in background.	[reply]
Re^2: Unexpected interaction between decode_entities() and lc() by kurisuto (Novice) on Nov 14, 2017 at 20:07 UTC
Aha! Adding "use feature 'unicode_strings';" fixed the problem. Thank you!	[reply]