in reply to Re: Unexpected interaction between decode_entities() and lc()
in thread Unexpected interaction between decode_entities() and lc()

My real intent is to read from a file, and that's where I first noticed the problem. I made the script above to test and illustrate the problem. You said you're assuming that my script source file is UTF-8 encoded, and yes, that's the case.

Following is a tiny script which reads from a file and which has the same problem:

#!/usr/bin/perl5.16.3 + use strict; use HTML::Entities; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while(<>) { utf8::decode($_); chomp; $_ = decode_entities($_); $_ = lc($_); print $_, "\n"; }

Here are the contents of my test input file:

Édition limitée.
&Eacute;dition limitée.

&Eacute; still ends up as É, not é as intended. I think I'm doing what I'm supposed to do in terms of enabling the UTF-8 flag on $_, but please let me know if I've missed something.

Replies are listed 'Best First'.
Re^3: Unexpected interaction between decode_entities() and lc()
by haukex (Archbishop) on Nov 14, 2017 at 15:50 UTC

    First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings. Second, it seems binmodeing STDIN is not enough - instead, try the open pragma with the :std option <update> since that also sets the default open modes and works on files passed on the command line as well </update>. The following works for me, both for a UTF-8 file piped into perl and a file listed on the command line:

    use warnings; use strict; use open qw/:std :utf8/; use HTML::Entities; while(<>) { chomp; print lc(decode_entities($_)), "\n"; }

    Update: Added a missing "not", oops :-)

      First, I would strongly recommend against using any of the functions from the utf8 module, since these are really only meant for use if one knows the inner workings of how Perl handles Unicode and non-Unicode strings.

      That's not true.

      utf8::encode and utf8::decode
      Safe. Used for efficient encoding and decoding.
      utf8::upgrade and utf8::downgrade
      Safe. Used for working around The Unicode Bug.
      utf8::is_utf8
      AVOID. Only useful when checking code for The Unicode Bug. Any use outside of debugging suffers from The Unicode Bug by definition. Use one of the previously-named functions instead.
      utf8::valid
      Safe, but extremely limited use (as it checks if scalars are well-formed).

      It's actually Encode that has the subs you should avoid.

      Encode::encode and Encode::decode
      Safe. Used for encoding and decoding.
      Encode::is_utf8
      AVOID. Only useful when checking code for The Unicode Bug. Any use outside of debugging suffers from The Unicode Bug by definition.
      Encode::_utf8_on and Encode::_utf8_off
      UNSAFE. There are no reason to ever use these, so NEVER use these. Use one of the utf8:: functions instead.
      Encode::_utf8_on($s) is short for utf8::decode($s) if !utf8::is_utf8($s);, which suffers from The Unicode Bug by definition.
      Encode::_utf8_off($s) is short for utf8::encode($s) if utf8::is_utf8($s);, which suffers from The Unicode Bug by definition.

      The Unicode Bug refers to code whose behaviour depends on the internal storage format of a string (i.e. the value returned by utf8::is_utf8 and Encode::is_utf8).

        Thanks for clarifying! I knew about utf8::is_utf8 and utf8::valid and I guess I extrapolated too much.

Re^3: Unexpected interaction between decode_entities() and lc()
by hippo (Archbishop) on Nov 14, 2017 at 15:44 UTC

    The encoding layer is already handling that, so forget about the utf8::decode($_); line and it all just works:

    $ cat uct.pl 
    #!/usr/bin/perl5.16.3                                                                                            
    
    use strict;
    use HTML::Entities;
    
    binmode STDIN, ':encoding(UTF-8)';
    binmode STDOUT, ':encoding(UTF-8)';
    
    while(<>) {
        chomp;
    
        $_ = decode_entities($_);
        $_ = lc($_);
    
        print $_, "\n";
    }
    $ echo -e "Édition limitée.\n&Eacute;dition limitée." | perl uct.pl 
    édition limitée.
    édition limitée.
    

    Update: forgot to mention: this is on perl 5.20.3 regardless of your #! line.

      Unfortunately that only works if piping stuff into Perl, but it does not work if files are specified on the command line, since those are opened and are not affected by binmode STDIN (see my post here).

        Update: Nonsense. Sorry.

        You can use

        binmode ARGV, ':encoding(UTF-8)';

        to affect the encoding of the input coming through the diamond operator.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,