in reply to Re^3: Seeking Perl docs about how UTF8 flag propagates
in thread Seeking Perl docs about how UTF8 flag propagates

The function length() always reports the number of characters in the string: you do not need to know whether the UTF8 flag is set on your string to understand what it will do.

The function lc() on the other hand will give different results for the same string (ie a string consisting of the same characters) depending on whether the UTF8 flag is set or not. As such it is an example of the Unicode bug in action.

This is not necessarily wrong - it is after all documented behaviour. But it does mean that, despite the intent, the programmer needs to know how the UTF8 flag will have been set to correctly predict the behaviour of lc() on strings containing certain characters.

Replies are listed 'Best First'.
Re^5: Seeking Perl docs about how UTF8 flag propagates
by LanX (Saint) on May 16, 2023 at 09:48 UTC
    I'm puzzled.

    Of course length can also give you different results if you change the UTF8 flag.

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

      I believe you are thinking in terms of particular encodings, which knowledge of the Unicode bug wrongly tempts one to do. I'm not talking about taking an internal representation as a sequence of bytes and then flipping the UTF8 flag on that internal representation, I'm talking about the actual strings represented by the internal representation.

      A string in Perl is a sequence of characters, not the sequence of bytes (or octets) that represents those characters in a particular encoding. If length() gives a different answer on two strings, then they are not the same sequence of characters.

      In a Unicode world there is one string "fu\x{df}" consisting of three characters. Internally Perl might encode that in one of two different ways, resulting in different byte sequences and a different setting of the UTF8 flag, but it is the same string whichever encoding is used.

      So in the code below, I would expect verify_upgraded_length and verify_downgraded_length to return a TRUE value for every string input (if they return at all).

      use utf8 (); sub verify_upgraded_length { my($s) = @_; # wrong # my $u = utf8::upgrade($s); my $u = $s; utf8::upgrade($u); return length($s) == length($u); } sub verify_downgraded_length { my($s) = @_; # wrong # my $d = utf8::downgrade($s); # dies if downgrade not possible my $d = $s; utf8::downgrade($d); # dies if downgrade not possible return length($s) == length($d); }

      [Updated: corrected code, thanks haj++]

        Nitpicking (again): utf8::upgrade and utf8::downgrade do not return converted strings. They do the change inplace, and return a success indicator instead.

        One solution is to take a copy (also note that for current Perls, use utf8 isn't needed):

        sub verify_upgraded_length { my($s) = @_; my $u = $s; utf8::upgrade($u); return length($s) == length($u); } sub verify_downgraded_length { my($s) = @_; my $d = $s; utf8::downgrade($d); # dies if downgrade not possible return length($s) == length($d); }
      Of course length can also give you different results if you change the UTF8 flag.

      Could you please give an example for this? How do you "change the UTF8 flag"?

      • utf8::encode and utf8::decode change the flag, but also the characters (same for Encode). So, obviously, the length can change.
      • utf8::upgrade and utf8::downgrade change the flag and the internal representation, but keep the length.
        > Could you please give an example for this? How do you "change the UTF8 flag"?

        use v5.12.0; use warnings; #use Devel::Peek; use utf8; use Encode qw(is_utf8 _utf8_on _utf8_off); my $str = "ä"; say $str, ":",length($str); #Dump($str); _utf8_off($str); say $str, ":",length($str); #Dump($str);

        ä:1 ä:2

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery