in reply to Unexpected interaction between decode_entities() and lc()

It took a little editing but I was able to reproduce this (Linux / Perl 5.26). Note that AFAIK PerlMonks defaults to the Latin1 character set encoding, so as opposed to the file you get from the [download] link, I am assuming that your source code file is correctly encoded in valid UTF-8 and contains the correctly encoded characters "É" and "é". In my post below I am using the <pre> instead of <code> tags (and a few other replacements) in order to get PerlMonks to display the post correctly. For posting on PerlMonks, personally I would not use utf8; and use \x{00C9} escapes instead, although then again, if I do that with your code, none of the input strings are flagged as UTF-8. (Update: With \N{U+00C9}, they do get flagged as UTF8.)

It appears that HTML::Entities's decode_entities does not enable the UTF8 flag on the string, so that the non-UTF8 string "&Eacute;dition limit&eacute;e." does not get upgraded.

#!/usr/bin/env perl
use warnings;
use strict;
use utf8;
use open qw/:std :utf8/;
use HTML::Entities;
use Devel::Peek;

my @strs = ("Édition limitÉe.",
	"Édition limit&Eacute;e.",
	"&Eacute;dition limit&Eacute;e.");
for my $str (@strs) {
	$str = decode_entities($str);
	Dump($str);
	$str = lc($str);
	Dump($str);
	print "{$str}\n";
}

__END__

# Output edited for brevity
SV = ... FLAGS = (POK,pPOK,UTF8)
  PV = 0xf667c0 "\303\211dition limit\303\211e."\0 [UTF8 "\x{c9}dition limit\x{c9}e."]
SV = ... FLAGS = (POK,pPOK,UTF8)
  PV = 0xf667c0 "\303\251dition limit\303\251e."\0 [UTF8 "\x{e9}dition limit\x{e9}e."]
{édition limitée.}
SV = ... FLAGS = (POK,pPOK,UTF8)
  PV = 0xffde20 "\303\211dition limit\303\211e."\0 [UTF8 "\x{c9}dition limit\x{c9}e."]
SV = ... FLAGS = (POK,pPOK,UTF8)
  PV = 0xffde20 "\303\251dition limit\303\251e."\0 [UTF8 "\x{e9}dition limit\x{e9}e."]
{édition limitée.}
SV = ... FLAGS = (POK,pPOK)
  PV = 0xfe3440 "\311dition limit\311e."\0
SV = ... FLAGS = (POK,pPOK)
  PV = 0xfe3440 "\311dition limit\311e."\0
{Édition limitÉe.}

The "correct" way to solve this depends a bit on where these strings you are getting are coming from. Are they all embedded in your source code? Are you reading them from a file? If so, are you opening the files with the correct layers, that is for exaple, open my $fh, '<:encoding(UTF-8)', ...?

Update: In the code, changed all é to É and &eacute; to &Eacute; to make the effects more clear. (Also, I can't reproduce the dependence on the -w switch that 1nickt reported.)

Replies are listed 'Best First'.
Re^2: Unexpected interaction between decode_entities() and lc()
by kurisuto (Novice) on Nov 14, 2017 at 15:23 UTC

    My real intent is to read from a file, and that's where I first noticed the problem. I made the script above to test and illustrate the problem. You said you're assuming that my script source file is UTF-8 encoded, and yes, that's the case.

    Following is a tiny script which reads from a file and which has the same problem:

    #!/usr/bin/perl5.16.3 + use strict; use HTML::Entities; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while(<>) { utf8::decode($_); chomp; $_ = decode_entities($_); $_ = lc($_); print $_, "\n"; }

    Here are the contents of my test input file:

    Édition limitée.
    &Eacute;dition limitée.
    

    &Eacute; still ends up as É, not é as intended. I think I'm doing what I'm supposed to do in terms of enabling the UTF-8 flag on $_, but please let me know if I've missed something.

      First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings. Second, it seems binmodeing STDIN is not enough - instead, try the open pragma with the :std option <update> since that also sets the default open modes and works on files passed on the command line as well </update>. The following works for me, both for a UTF-8 file piped into perl and a file listed on the command line:

      use warnings; use strict; use open qw/:std :utf8/; use HTML::Entities; while(<>) { chomp; print lc(decode_entities($_)), "\n"; }

      Update: Added a missing "not", oops :-)

        First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings.

        That's not true.

        utf8::encode and utf8::decode
        Safe. Used for efficient encoding and decoding.
        utf8::upgrade and utf8::downgrade
        Safe. Used for working around The Unicode Bug.
        utf8::is_utf8
        AVOID. Only useful when checking code for The Unicode Bug. Any use outside of debugging suffers from The Unicode Bug by definition. Use one of the previously-named functions instead.
        utf8::valid
        Safe, but extremely limited use (as it checks if scalars are well-formed).

        It's actually Encode that has the subs you should avoid.

        Encode::encode and Encode::decode
        Safe. Used for encoding and decoding.
        Encode::is_utf8
        AVOID. Only useful when checking code for The Unicode Bug. Any use outside of debugging suffers from The Unicode Bug by definition.
        Encode::_utf8_on and Encode::_utf8_off
        UNSAFE. There are no reason to ever use these, so NEVER use these. Use one of the utf8:: functions instead.
        Encode::_utf8_on($s) is short for utf8::decode($s) if !utf8::is_utf8($s);, which suffers from The Unicode Bug by definition.
        Encode::_utf8_off($s) is short for utf8::encode($s) if utf8::is_utf8($s);, which suffers from The Unicode Bug by definition.

        The Unicode Bug refers to code whose behaviour depends on the internal storage format of a string (i.e. the value returned by utf8::is_utf8 and Encode::is_utf8).

      The encoding layer is already handling that, so forget about the utf8::decode($_); line and it all just works:

      $ cat uct.pl 
      #!/usr/bin/perl5.16.3                                                                                            
      
      use strict;
      use HTML::Entities;
      
      binmode STDIN, ':encoding(UTF-8)';
      binmode STDOUT, ':encoding(UTF-8)';
      
      while(<>) {
          chomp;
      
          $_ = decode_entities($_);
          $_ = lc($_);
      
          print $_, "\n";
      }
      $ echo -e "Édition limitée.\n&Eacute;dition limitée." | perl uct.pl 
      édition limitée.
      édition limitée.
      

      Update: forgot to mention: this is on perl 5.20.3 regardless of your #! line.

        Unfortunately that only works if piping stuff into Perl, but it does not work if files are specified on the command line, since those are opened and are not affected by binmode STDIN (see my post here).