in reply to Seeking Perl docs about how UTF8 flag propagates

What have I overlooked?

The principle is that you don't need to know: it's an internal flag, and you just need to trust that strings will behave as they should. The intent is that when the UTF8 flag is turned off, this is purely an optimization that allows the internals to do various things in a simpler, faster way. So that's why the docs aren't littered with discussions of what effect each operation has on the flag.

Do you have a specific reason for wanting to know the state of the flag in particular cases? I'm sure we can help answer questions about specifics.

  • Comment on Re: Seeking Perl docs about how UTF8 flag propagates

Replies are listed 'Best First'.
Re^2: Seeking Perl docs about how UTF8 flag propagates
by raygun (Scribe) on May 15, 2023 at 19:37 UTC
    The intent is that when the UTF8 flag is turned off, this is purely an optimization that allows the internals to do various things in a simpler, faster way.
    If that's the intent, it doesn't always work in practice, when certain functions (e.g., lc) change their behavior depending on how the flag is set. But I take your point that because this is the intent (whether it works that way in practice or not), the info I'm seeking is undocumented. So I have the answer I need, thank you.

      certain functions (e.g., lc) change their behavior depending on how the flag is set.

      Yes, this is indeed the fly in the ointment. As far as I know such cases are documented - and in this case at least the documentation describes mechanisms that force it to behave one way or another independent of the UTF8 flag (eg use bytes versus use feature 'unicode_strings').

      Those aspects of Perl that requires you to know the state of the UTF8 flag are collectively known as "the Unicode bug", and there is more detail in a section devoted to this in perlunicode.

        Those aspects of Perl that requires you to know the state of the UTF8 flag
        But I'm not asking "what is the state of the UTF8 flag." I'm asking, "Does a given operation preserve the state?"
        $str2 = $str; $str3 = sprintf ("str is %s", $str); @words = split(/ /, $str);
        Do $str2, $str3, and $words[2] have the same flag value as $str? Does it depend on other factors? Is it undefined? (I suppose since it's intentionally undocumented, it's at least theoretically undefined.)

      If that's the intent

      It is. Code that behaves differently based on the internal storage format is said to suffer from The Unicode Bug.

      it doesn't always work in practice

      True. Notably, the operators that accept file names. And of course, some XS modules.

      utf8::upgrade and utf8::downgrade can be used to work around these bugs.

      certain functions (e.g., lc) change their behavior depending on how the flag is set.

      lc, uc and the regex engine were fixed in 5.14, released in 2011 (12 years ago).

      To get the fix, you need to use use v5.14;, or use feature qw( unicode_strings ); more specifically.

      (The feature actually appeared in 5.12, but it didn't fix as many things in 5.12 as in 5.14, so I pretend it was added in 5.14.)

      Could you please provide an example where lc behaves different, depending on the flag?

      As far as I know lc will simply preserve the flag of the input (I am not sure whether this holds on EBCDIC platforms).

      The opposite function, uc, is known to set the flag for a (non-flagged) input of chr 0xFF or 'ÿ': Its uppercase equivalent 'Ÿ' is not present in ISO-8859-1, but taken from the Unicode block Latin Extended-A.

        Not sure about lc(), but here's another case where the closely-related uc() behaves differently:

        $ascii = "\x{df}"; chop($utfer = "\x{100}"); $utf = $ascii . $utfer; print uc($_) for ($ascii, $utf);

        As a Unicode codepoint, "\x{df}" is interpreted as the lowercase German "es-zed" character (ß), which uppercases to "SS". As an ASCII codepoint it is seen as a non-word character, and does not change.

        This is a rare case where changing the case of a string also changes its length.

      > when certain functions (e.g., lc) change their behavior depending on how the flag is set

      That's the point you seem to be missing.

      The function length must report different numbers of characters, if 2-4 bytes are supposed to represent a unicode entity because of the utf8-flag. Same for other functions.

      Otherwise please be more specific about what lc does wrongly...

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        The function length() always reports the number of characters in the string: you do not need to know whether the UTF8 flag is set on your string to understand what it will do.

        The function lc() on the other hand will give different results for the same string (ie a string consisting of the same characters) depending on whether the UTF8 flag is set or not. As such it is an example of the Unicode bug in action.

        This is not necessarily wrong - it is after all documented behaviour. But it does mean that, despite the intent, the programmer needs to know how the UTF8 flag will have been set to correctly predict the behaviour of lc() on strings containing certain characters.

        Otherwise please be more specific about what lc does wrongly...
        As far as I know it does nothing wrongly. It just does things differently depending on previous operations whose effects are not fully documented.