in reply to Re^7: Seeking Perl docs about how UTF8 flag propagates
in thread Seeking Perl docs about how UTF8 flag propagates

These days I'm working with binary data in Perl, so I'm a bit skeptical about such claims. Consider:

my $upper_a = "A";

In $upper_a, the UTF8 Flag is off. All character functions of Perl work on this variable, and that's not only for backwards compatibility. I'd rather call it a text string than a binary string.

If I create a character with $upper_a = "\N{U+41}", then the result has the UTF8 flag on. It behaves in no way different from $upper_a created before.

Binary strings in Perl need careful treatment. Strings where the UTF8 flag is off qualify as binary strings - but these can also be used as character strings. Applying character functions to a binary string might switch the UTF8 flag on. The flag is contagious, one term where the flag is on (even when it is on an character from the ASCII set) is sufficient to have the result carrying the flag. Whether that's a problem depends on whether each position (to avoid the terms "byte" and "character" for the moment) holds a value of less than 256.

If I create an 'ä' as \N{U+E4}, then the UTF8 flag is on. Perl's internal PV is "\303\244" (0xC3A4). But still, it can be treated like a binary 0xE4: It matches /\xE4/, and it can be printed without a "wide character" warning as a single byte.

Replies are listed 'Best First'.
Re^9: Seeking Perl docs about how UTF8 flag propagates (Terminology)
by LanX (Saint) on May 17, 2023 at 17:15 UTC
    One major problem is - as usual -that various perldocs were written by different authors each using another set of terminology.
    • binary, byte, octet, etc for strings without flag
    • character, text, etc for strings with flag
    • sometimes other words like "stream" are used to abstract "string".
    • the flag is named utf8-flag even if the internal format isn't 100% UTF8 and we still need to decode utf8 encoded strings.
    • (update: mind also UTF-8 vs. utf8 vs. UTF8)
    Furthermore is Perl - predominantly - operator dominated, contrary to type dominations in other languages.

    We can use a bit operation on strings and so called the "utf8-flag" shouldn't matter at all, because the text-interpretation of the data is irrelevant.

    Much of the confusion in this thread is due to terminology.

    And when I read canonical sources like the following it's messy:

    Those perldocs are only the tip of the iceberg, because various other operator specific docs build on their terminology (or not)

    I wonder if the authors would be able to talk to each other.

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

      Indeed!

      As you know, I've written a fair bit about this. I use these terms to be as unambiguous as possible:


      Terms describing what a string represents (unrelated to storage format):

      • String of decoded text aka string of Unicode Code Points
      • String of bytes, such as a string of encoded text

      For example, /\w/ only works when applied to a string of decoded text.

      For example, pack returns a string of bytes.

      "String of bytes" can still confuse people, so I try to avoid that one by being more specific (e.g. by using "encoded text" instead of "bytes").


      Terms describing the internal storage format (unrelated to semantics):

      • Downgraded string
      • Upgraded string
        Regarding wording and reasoning, Encode seems to be the most consistent from all docs I've skimmed thru yet.

        It starts by clarifying it's Terminology and seems to stay true to it.

        It would be nice to find a way to harmonize the other docs based on this one. At least taking Encode 's style as starting point.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

        > For example, \w only works when applied to a string of decoded text.

        I think it's better to say that \w defaults to ASCII.

        So if the encoded text is ASCII it'll "work".

        If it's Latin-1, \w won't match the extra alphanumerics.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery