Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^9: Seeking Perl docs about how UTF8 flag propagates (Terminology)

by LanX (Saint)
on May 17, 2023 at 17:15 UTC ( [id://11152261]=note: print w/replies, xml ) Need Help??


in reply to Re^8: Seeking Perl docs about how UTF8 flag propagates
in thread Seeking Perl docs about how UTF8 flag propagates

One major problem is - as usual -that various perldocs were written by different authors each using another set of terminology.
  • binary, byte, octet, etc for strings without flag
  • character, text, etc for strings with flag
  • sometimes other words like "stream" are used to abstract "string".
  • the flag is named utf8-flag even if the internal format isn't 100% UTF8 and we still need to decode utf8 encoded strings.
  • (update: mind also UTF-8 vs. utf8 vs. UTF8)
Furthermore is Perl - predominantly - operator dominated, contrary to type dominations in other languages.

We can use a bit operation on strings and so called the "utf8-flag" shouldn't matter at all, because the text-interpretation of the data is irrelevant.

Much of the confusion in this thread is due to terminology.

And when I read canonical sources like the following it's messy:

Those perldocs are only the tip of the iceberg, because various other operator specific docs build on their terminology (or not)

I wonder if the authors would be able to talk to each other.

Cheers Rolf
(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
Wikisyntax for the Monastery

  • Comment on Re^9: Seeking Perl docs about how UTF8 flag propagates (Terminology)

Replies are listed 'Best First'.
Re^10: Seeking Perl docs about how UTF8 flag propagates (Terminology)
by ikegami (Patriarch) on May 17, 2023 at 18:41 UTC

    Indeed!

    As you know, I've written a fair bit about this. I use these terms to be as unambiguous as possible:


    Terms describing what a string represents (unrelated to storage format):

    • String of decoded text aka string of Unicode Code Points
    • String of bytes, such as a string of encoded text

    For example, /\w/ only works when applied to a string of decoded text.

    For example, pack returns a string of bytes.

    "String of bytes" can still confuse people, so I try to avoid that one by being more specific (e.g. by using "encoded text" instead of "bytes").


    Terms describing the internal storage format (unrelated to semantics):

    • Downgraded string
    • Upgraded string
      Regarding wording and reasoning, Encode seems to be the most consistent from all docs I've skimmed thru yet.

      It starts by clarifying it's Terminology and seems to stay true to it.

      It would be nice to find a way to harmonize the other docs based on this one. At least taking Encode 's style as starting point.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        Really?

        Character's definition is wrong. Perl has 32 or 64 bit chars, not 32 bit chars. And these days, it's usually 64 bit chars. Theoretically, the encoding supports 72 bit chars, but they gotta fit in a UV for Perl to be able to work with them, so the size of a UV controls the range of a char.

        It's also not that clear. The important part is that a character is an element of a string. For example, substr( $_, $i, 1 ) returns the character at offset $i. I would also mention the range, but only after saying it's the element of a string. We could also mention chr and ord as means of switching between representation of that chararacter. But, while I disagree with the wording, I do agree with the term and its meaning.

        Then there's byte and octet, and I can't tell how they are different from each other from the Terminology section. And anyone that uses different these two words to mean two different things needs to find better terms.

        And that's it? Where's the rest? The phrases we actually need?

        If we continue on to next section it starts with saying that encode "Encodes the scalar value STRING from Perl's internal form into ENCODING and returns a sequence of octets." wtf does the internal form has to do with anything? The input is expected to be a string of Unicode Code Points. The internal form of those UCP is not relevant.

        Four sentences, and five problems. And that's not counting all the missing terminology. This is not what I was expecting when you gave it a gold star.

      > For example, \w only works when applied to a string of decoded text.

      I think it's better to say that \w defaults to ASCII.

      So if the encoded text is ASCII it'll "work".

      If it's Latin-1, \w won't match the extra alphanumerics.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        That's wrong.

        It doesn't "default to ASCII". It works against decoded text aka string of Unicode Code Points. Always. This can be demonstrated using "\N{U+100}" =~ /\w/ (which matches). You need to use /a if to limit it to the ASCII range.

        Text encoded using ASCII happens to work because $x eq encode( "US-ASCII", $x ).

        Text encoded using iso-latin-1 happens to work because $x eq encode( "iso-latin-1", $x ) (though do see last paragraph).

        Those are just side effects of \w working on decoded text.

        There was a bug where \w didn't work for characters in U+0080..U+00FF sometimes. This was fixed 12 years ago in 2011. Add use v5.14; to get the fix.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11152261]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2024-04-24 19:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found