Re: Unexpected interaction between decode

It took a little editing but I was able to reproduce this (Linux / Perl 5.26). Note that AFAIK PerlMonks defaults to the Latin1 character set encoding, so as opposed to the file you get from the [download] link, I am assuming that your source code file is correctly encoded in valid UTF-8 and contains the correctly encoded characters "É" and "é". In my post below I am using the <pre> instead of <code> tags (and a few other replacements) in order to get PerlMonks to display the post correctly. For posting on PerlMonks, personally I would not use utf8; and use \x{00C9} escapes instead, although then again, if I do that with your code, none of the input strings are flagged as UTF-8. (Update: With \N{U+00C9}, they do get flagged as UTF8.)

It appears that HTML::Entities's decode_entities does not enable the UTF8 flag on the string, so that the non-UTF8 string "Édition limitée." does not get upgraded.

#!/usr/bin/env perl
use warnings;
use strict;
use utf8;
use open qw/:std :utf8/;
use HTML::Entities;
use Devel::Peek;

my @strs = ("Édition limitÉe.",
	"Édition limit&Eacute;e.",
	"&Eacute;dition limit&Eacute;e.");
for my $str (@strs) {
	$str = decode_entities($str);
	Dump($str);
	$str = lc($str);
	Dump($str);
	print "{$str}\n";
}

__END__

# Output edited for brevity
SV = ... FLAGS = (POK,pPOK,UTF8)
  PV = 0xf667c0 "\303\211dition limit\303\211e."\0 [UTF8 "\x{c9}dition limit\x{c9}e."]
SV = ... FLAGS = (POK,pPOK,UTF8)
  PV = 0xf667c0 "\303\251dition limit\303\251e."\0 [UTF8 "\x{e9}dition limit\x{e9}e."]
{édition limitée.}
SV = ... FLAGS = (POK,pPOK,UTF8)
  PV = 0xffde20 "\303\211dition limit\303\211e."\0 [UTF8 "\x{c9}dition limit\x{c9}e."]
SV = ... FLAGS = (POK,pPOK,UTF8)
  PV = 0xffde20 "\303\251dition limit\303\251e."\0 [UTF8 "\x{e9}dition limit\x{e9}e."]
{édition limitée.}
SV = ... FLAGS = (POK,pPOK)
  PV = 0xfe3440 "\311dition limit\311e."\0
SV = ... FLAGS = (POK,pPOK)
  PV = 0xfe3440 "\311dition limit\311e."\0
{Édition limitÉe.}

The "correct" way to solve this depends a bit on where these strings you are getting are coming from. Are they all embedded in your source code? Are you reading them from a file? If so, are you opening the files with the correct layers, that is for exaple, open my $fh, '<:encoding(UTF-8)', ...?

Update: In the code, changed all é to É and é to É to make the effects more clear. (Also, I can't reproduce the dependence on the -w switch that 1nickt reported.)

Comment on Re: Unexpected interaction between decode_entities() and lc() Select or Download Code

Replies are listed 'Best First'.
Re^2: Unexpected interaction between decode_entities() and lc() by kurisuto (Novice) on Nov 14, 2017 at 15:23 UTC
My real intent is to read from a file, and that's where I first noticed the problem. I made the script above to test and illustrate the problem. You said you're assuming that my script source file is UTF-8 encoded, and yes, that's the case. Following is a tiny script which reads from a file and which has the same problem: `#!/usr/bin/perl5.16.3 + use strict; use HTML::Entities; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while(<>) { utf8::decode($_); chomp; $_ = decode_entities($_); $_ = lc($_); print $_, "\n"; }` [download] Here are the contents of my test input file: Édition limitée. Édition limitée. É still ends up as É, not é as intended. I think I'm doing what I'm supposed to do in terms of enabling the UTF-8 flag on $_, but please let me know if I've missed something.	[reply] [d/l]
Re^3: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 14, 2017 at 15:50 UTC
First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings. Second, it seems binmodeing `STDIN` is not enough - instead, try the open pragma with the `:std` option `<update>` since that also sets the default open modes and works on files passed on the command line as well `</update>`. The following works for me, both for a UTF-8 file piped into `perl` and a file listed on the command line: `use warnings; use strict; use open qw/:std :utf8/; use HTML::Entities; while(<>) { chomp; print lc(decode_entities($_)), "\n"; }` [download] Update: Added a missing "not", oops :-)	[reply] [d/l] [select]
Re^4: Unexpected interaction between decode_entities() and lc() by ikegami (Patriarch) on Nov 16, 2017 at 19:00 UTC
First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings. That's not true. `utf8::encode` and `utf8::decode` Safe. Used for efficient encoding and decoding. `utf8::upgrade` and `utf8::downgrade` Safe. Used for working around The Unicode Bug. `utf8::is_utf8` AVOID. Only useful when checking code for The Unicode Bug. Any use outside of debugging suffers from The Unicode Bug by definition. Use one of the previously-named functions instead. `utf8::valid` Safe, but extremely limited use (as it checks if scalars are well-formed). It's actually Encode that has the subs you should avoid. `Encode::encode` and `Encode::decode` Safe. Used for encoding and decoding. `Encode::is_utf8` AVOID. Only useful when checking code for The Unicode Bug. Any use outside of debugging suffers from The Unicode Bug by definition. `Encode::_utf8_on` and `Encode::_utf8_off` UNSAFE. There are no reason to ever use these, so NEVER use these. Use one of the utf8:: functions instead. `Encode::_utf8_on($s)` is short for `utf8::decode($s) if !utf8::is_utf8($s);`, which suffers from The Unicode Bug by definition. `Encode::_utf8_off($s)` is short for `utf8::encode($s) if utf8::is_utf8($s);`, which suffers from The Unicode Bug by definition. The Unicode Bug refers to code whose behaviour depends on the internal storage format of a string (i.e. the value returned by `utf8::is_utf8` and `Encode::is_utf8`).	[reply] [d/l] [select]
Re^5: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 16, 2017 at 19:46 UTC
Re^3: Unexpected interaction between decode_entities() and lc() by hippo (Archbishop) on Nov 14, 2017 at 15:44 UTC
The encoding layer is already handling that, so forget about the `utf8::decode($_);` line and it all just works: $ cat uct.pl #!/usr/bin/perl5.16.3 use strict; use HTML::Entities; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while(<>) { chomp; $_ = decode_entities($_); $_ = lc($_); print $_, "\n"; } $ echo -e "Édition limitée.\nÉdition limitée." \| perl uct.pl édition limitée. édition limitée. Update: forgot to mention: this is on perl 5.20.3 regardless of your #! line.	[reply] [d/l]
Re^4: Unexpected interaction between decode_entities() and lc() by haukex (Archbishop) on Nov 14, 2017 at 15:53 UTC
Unfortunately that only works if piping stuff into Perl, but it does not work if files are specified on the command line, since those are opened and are not affected by `binmode STDIN` (see my post here).	[reply] [d/l]
Re^5: Unexpected interaction between decode_entities() and lc() by choroba (Cardinal) on Nov 14, 2017 at 16:51 UTC