kurisuto has asked for the wisdom of the Perl Monks concerning the following question:

Here's my code:
#!/usr/bin/perl5.16.3 -w + use strict; use utf8; use HTML::Entities; binmode STDOUT, ':encoding(UTF-8)'; my $string1 = "Édition limitée."; $string1 = decode_entities($string1); $string1 = lc($string1); print $string1, "\n"; # Yields "édition limitée." + my $string2 = "Édition limitée."; $string2 = decode_entities($string2); $string2 = lc($string2); print $string2, "\n"; # Yields "Édition limitée."

For some reason, the "É" character is lowercased in the first example but not the second. The difference in the second case is that the character originated as the "É" entity.

I searched around on the web and could not find an explanation. Thanks for any help.

Replies are listed 'Best First'.
Re: Unexpected interaction between decode_entities() and lc()
by haukex (Archbishop) on Nov 14, 2017 at 13:55 UTC

    It took a little editing but I was able to reproduce this (Linux / Perl 5.26). Note that AFAIK PerlMonks defaults to the Latin1 character set encoding, so as opposed to the file you get from the [download] link, I am assuming that your source code file is correctly encoded in valid UTF-8 and contains the correctly encoded characters "É" and "é". In my post below I am using the <pre> instead of <code> tags (and a few other replacements) in order to get PerlMonks to display the post correctly. For posting on PerlMonks, personally I would not use utf8; and use \x{00C9} escapes instead, although then again, if I do that with your code, none of the input strings are flagged as UTF-8. (Update: With \N{U+00C9}, they do get flagged as UTF8.)

    It appears that HTML::Entities's decode_entities does not enable the UTF8 flag on the string, so that the non-UTF8 string "&Eacute;dition limit&eacute;e." does not get upgraded.

    #!/usr/bin/env perl
    use warnings;
    use strict;
    use utf8;
    use open qw/:std :utf8/;
    use HTML::Entities;
    use Devel::Peek;
    
    my @strs = ("Édition limitÉe.",
    	"Édition limit&Eacute;e.",
    	"&Eacute;dition limit&Eacute;e.");
    for my $str (@strs) {
    	$str = decode_entities($str);
    	Dump($str);
    	$str = lc($str);
    	Dump($str);
    	print "{$str}\n";
    }
    
    __END__
    
    # Output edited for brevity
    SV = ... FLAGS = (POK,pPOK,UTF8)
      PV = 0xf667c0 "\303\211dition limit\303\211e."\0 [UTF8 "\x{c9}dition limit\x{c9}e."]
    SV = ... FLAGS = (POK,pPOK,UTF8)
      PV = 0xf667c0 "\303\251dition limit\303\251e."\0 [UTF8 "\x{e9}dition limit\x{e9}e."]
    {édition limitée.}
    SV = ... FLAGS = (POK,pPOK,UTF8)
      PV = 0xffde20 "\303\211dition limit\303\211e."\0 [UTF8 "\x{c9}dition limit\x{c9}e."]
    SV = ... FLAGS = (POK,pPOK,UTF8)
      PV = 0xffde20 "\303\251dition limit\303\251e."\0 [UTF8 "\x{e9}dition limit\x{e9}e."]
    {édition limitée.}
    SV = ... FLAGS = (POK,pPOK)
      PV = 0xfe3440 "\311dition limit\311e."\0
    SV = ... FLAGS = (POK,pPOK)
      PV = 0xfe3440 "\311dition limit\311e."\0
    {Édition limitÉe.}
    

    The "correct" way to solve this depends a bit on where these strings you are getting are coming from. Are they all embedded in your source code? Are you reading them from a file? If so, are you opening the files with the correct layers, that is for exaple, open my $fh, '<:encoding(UTF-8)', ...?

    Update: In the code, changed all é to É and &eacute; to &Eacute; to make the effects more clear. (Also, I can't reproduce the dependence on the -w switch that 1nickt reported.)

      My real intent is to read from a file, and that's where I first noticed the problem. I made the script above to test and illustrate the problem. You said you're assuming that my script source file is UTF-8 encoded, and yes, that's the case.

      Following is a tiny script which reads from a file and which has the same problem:

      #!/usr/bin/perl5.16.3 + use strict; use HTML::Entities; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while(<>) { utf8::decode($_); chomp; $_ = decode_entities($_); $_ = lc($_); print $_, "\n"; }

      Here are the contents of my test input file:

      Édition limitée.
      &Eacute;dition limitée.
      

      &Eacute; still ends up as É, not é as intended. I think I'm doing what I'm supposed to do in terms of enabling the UTF-8 flag on $_, but please let me know if I've missed something.

        First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings. Second, it seems binmodeing STDIN is not enough - instead, try the open pragma with the :std option <update> since that also sets the default open modes and works on files passed on the command line as well </update>. The following works for me, both for a UTF-8 file piped into perl and a file listed on the command line:

        use warnings; use strict; use open qw/:std :utf8/; use HTML::Entities; while(<>) { chomp; print lc(decode_entities($_)), "\n"; }

        Update: Added a missing "not", oops :-)

        The encoding layer is already handling that, so forget about the utf8::decode($_); line and it all just works:

        $ cat uct.pl 
        #!/usr/bin/perl5.16.3                                                                                            
        
        use strict;
        use HTML::Entities;
        
        binmode STDIN, ':encoding(UTF-8)';
        binmode STDOUT, ':encoding(UTF-8)';
        
        while(<>) {
            chomp;
        
            $_ = decode_entities($_);
            $_ = lc($_);
        
            print $_, "\n";
        }
        $ echo -e "Édition limitée.\n&Eacute;dition limitée." | perl uct.pl 
        édition limitée.
        édition limitée.
        

        Update: forgot to mention: this is on perl 5.20.3 regardless of your #! line.

Re: Unexpected interaction between decode_entities() and lc()
by 1nickt (Canon) on Nov 14, 2017 at 13:35 UTC

    Update: removing the -w flag from your shebang line fixes the problem.

    I cannot say what the issue is that is triggered by it, but in general you should not use -w in your program as it turns on all warnings in all code pulled into it. To limit to your own code, use use warnings; instead.

    Hope this helps!

    Orignal reply below:

    Hi, I can't reproduce your results on my system.

    $ perl 1203368.pl édition limitée. édition limitée.
    This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-li +nux


    The way forward always starts with a minimal test.

      I tried removing -w just now, and got the same output:

      édition limitée.
      Édition limitée.

      I'm not sure why -w would affect the behavior of decode_entities() or lc() (other than maybe printing warnings). I wasn't getting any warnings one way or the other; just unexpected output.

        I'm not sure either. But strange things like that do happen.

        More info: What's wrong with -w and $^W from the Perl documentation.


        The way forward always starts with a minimal test.
Re: Unexpected interaction between decode_entities() and lc()
by vr (Curate) on Nov 14, 2017 at 16:48 UTC

    I think there's missing bit in previous answers, i.e. the lc only follows the documentation:

    If use bytes is in effect: The results follow ASCII rules. Only the characters A-Z change, to a-z respectively.

    (Emphasis mine.) Since neither "use feature 'unicode_strings';", nor e.g. "use 5.016;" was declared, then lc does exactly as described above. BTW I'm impressed with the decode_entities clever behavior i.e. output depending on utf8 flag of its argument.

      use feature 'unicode_strings'; is an excellent point!

      Since neither "use feature 'unicode_strings';", nor e.g. "use 5.016;" was declared, then lc does exactly as described above.

      If by "described above" you mean "Only the characters A-Z change, to a-z respectively.", then I think your reading of the lc docs might be a little off, my understanding is that bytes is not the default behavior. The following test took a little fiddling to get the right values but it passes on all Perl releases starting with 5.8.1, 5.8.9, 5.10.1, up to 5.26 and shows the differences:

      use warnings;
      use strict;
      use utf8;
      use Test::More;
      
      diag explain "Perl $]";
      plan tests => $] ge '5.012' ? 15 : 11;
      SKIP: {
      	is "\N{U+00C9}", "É",           '\N{U+...} escape';
      	skip 'Perl ge 5.12 required', 1 unless $] ge '5.012';
      	ok utf8::is_utf8("\N{U+00C9}"), '\N{U+...} sets UTF8';
      }
      {
      	ok !utf8::is_utf8("\x{C9}"),    '\x doesn\'t set UTF8';
      	is lc("\x{C9}"), "\xC9",        'lc on non-UTF8 str';
      	ok utf8::is_utf8("É"),          'str is UTF8';
      	is lc("É"), "é",                'lc on UTF8 str';
      }
      {
      	use bytes;
      	ok !utf8::is_utf8("\x{C9}"),    'bytes: \x doesn\'t set UTF8';
      	is lc("\x{C9}"), "\xC9",        'bytes: lc on non-UTF8 str';
      	ok utf8::is_utf8("É"),          'bytes: str is UTF8';
      	is lc("É"), $] lt '5.008009' ? "\xC9" : "\xC3\x89",
      	                                'bytes: lc on UTF8 str';
      }
      SKIP: { skip 'Perl ge 5.12 required', 1 unless $] ge '5.012';
      ok eval q{ do {
      	use feature 'unicode_strings';
      	ok !utf8::is_utf8("\x{C9}"),    'u_s: \x doesn\'t set UTF8';
      	is lc("\x{C9}"), "é",           'u_s: lc on non-UTF8 str';
      	ok utf8::is_utf8("É"),          'u_s: str is UTF8';
      	is lc("É"), "é",                'u_s: lc on UTF8 str';
      1 } }, 'unicode_strings works' or warn $@ }
      

        You are right, I was too quick to paste a quote from lc documentation page, it should have been the last, fall-through case.

        Also, kurisuto, my comment was not a solution, rather an attempt at explanation (to myself) of what was happening -- too rarely I deal with extended-ASCII, and yet not-utf8 strings. Proper fix (at least, for anything but one-time scripts) would be to always explicitly decode inputs from all sources, not hoping them to be Latin-1 only, and Perl silently doing "the right thing" in background.

      Aha! Adding "use feature 'unicode_strings';" fixed the problem. Thank you!