Re^2: Unexpected interaction between decode

My real intent is to read from a file, and that's where I first noticed the problem. I made the script above to test and illustrate the problem. You said you're assuming that my script source file is UTF-8 encoded, and yes, that's the case.

Following is a tiny script which reads from a file and which has the same problem:

#!/usr/bin/perl5.16.3                                                 
+                                           

use strict;
use HTML::Entities;

binmode STDIN, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';

while(<>) {
    utf8::decode($_);
    chomp;

    $_ = decode_entities($_);
    $_ = lc($_);

    print $_, "\n";
}
[download]

Here are the contents of my test input file:

Édition limitée.
&Eacute;dition limitée.

É still ends up as É, not é as intended. I think I'm doing what I'm supposed to do in terms of enabling the UTF-8 flag on $_, but please let me know if I've missed something.

Comment on Re^2: Unexpected interaction between decode_entities() and lc() Download Code

Replies are listed 'Best First'.
Re^3: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 14, 2017 at 15:50 UTC
First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings. Second, it seems binmodeing `STDIN` is not enough - instead, try the open pragma with the `:std` option `<update>` since that also sets the default open modes and works on files passed on the command line as well `</update>`. The following works for me, both for a UTF-8 file piped into `perl` and a file listed on the command line: `use warnings; use strict; use open qw/:std :utf8/; use HTML::Entities; while(<>) { chomp; print lc(decode_entities($_)), "\n"; }` [download] Update: Added a missing "not", oops :-)	[reply] [d/l] [select]
Re^4: Unexpected interaction between decode_entities() and lc() by ikegami (Patriarch) on Nov 16, 2017 at 19:00 UTC
First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings. That's not true. `utf8::encode` and `utf8::decode` Safe. Used for efficient encoding and decoding. `utf8::upgrade` and `utf8::downgrade` Safe. Used for working around The Unicode Bug. `utf8::is_utf8` AVOID. Only useful when checking code for The Unicode Bug. Any use outside of debugging suffers from The Unicode Bug by definition. Use one of the previously-named functions instead. `utf8::valid` Safe, but extremely limited use (as it checks if scalars are well-formed). It's actually Encode that has the subs you should avoid. `Encode::encode` and `Encode::decode` Safe. Used for encoding and decoding. `Encode::is_utf8` AVOID. Only useful when checking code for The Unicode Bug. Any use outside of debugging suffers from The Unicode Bug by definition. `Encode::_utf8_on` and `Encode::_utf8_off` UNSAFE. There are no reason to ever use these, so NEVER use these. Use one of the utf8:: functions instead. `Encode::_utf8_on($s)` is short for `utf8::decode($s) if !utf8::is_utf8($s);`, which suffers from The Unicode Bug by definition. `Encode::_utf8_off($s)` is short for `utf8::encode($s) if utf8::is_utf8($s);`, which suffers from The Unicode Bug by definition. The Unicode Bug refers to code whose behaviour depends on the internal storage format of a string (i.e. the value returned by `utf8::is_utf8` and `Encode::is_utf8`).	[reply] [d/l] [select]
Re^5: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 16, 2017 at 19:46 UTC
Thanks for clarifying! I knew about `utf8::is_utf8` and `utf8::valid` and I guess I extrapolated too much.	[reply] [d/l] [select]
Re^3: Unexpected interaction between decode_entities() and lc() by hippo (Archbishop) on Nov 14, 2017 at 15:44 UTC
The encoding layer is already handling that, so forget about the `utf8::decode($_);` line and it all just works: $ cat uct.pl #!/usr/bin/perl5.16.3 use strict; use HTML::Entities; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while(<>) { chomp; $_ = decode_entities($_); $_ = lc($_); print $_, "\n"; } $ echo -e "Édition limitée.\nÉdition limitée." \| perl uct.pl édition limitée. édition limitée. Update: forgot to mention: this is on perl 5.20.3 regardless of your #! line.	[reply] [d/l]
Re^4: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 14, 2017 at 15:53 UTC
Unfortunately that only works if piping stuff into Perl, but it does not work if files are specified on the command line, since those are opened and are not affected by `binmode STDIN` (see my post here).	[reply] [d/l]
Re^5: Unexpected interaction between decode_entities() and lc() by choroba (Cardinal) on Nov 14, 2017 at 16:51 UTC
Update: Nonsense. Sorry. ~~You can use~~ `binmode ARGV, ':encoding(UTF-8)';` [download] ~~to affect the encoding of the input coming through the diamond operator.~~ ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]