Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^13: Seeking Perl docs about how UTF8 flag propagates (Terminology)

by LanX (Saint)
on May 18, 2023 at 22:11 UTC ( [id://11152290]=note: print w/replies, xml ) Need Help??


in reply to Re^12: Seeking Perl docs about how UTF8 flag propagates (Terminology)
in thread Seeking Perl docs about how UTF8 flag propagates

> This is not what I was expecting when you gave it a gold star.

Well your stars are more golden than mine.

Seriously, it says 32bits or more° and I don't care that much as long as Perl and Unicode aren't expanded to cover all scripts of the galaxy.

And I cringe about calling a byte a character. A string - as a sequence of bytes˛ - can hold any kind of packed data which fits into memory. Like eg JPG. Perl has also plenty of string operators which don't assume text.

BUT ... "from all docs I skimmed thru yet" ... it's the best in having an axiomatic build up with clarifying the terminology first.

And as I said "taking the style as starting point."

Cheers Rolf
(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
Wikisyntax for the Monastery

°) "...range 0 .. 2**32-1 (or more)"

˛) the idea seems to be to define a "logical" character as the sub-units of strings as returned by split // or 'length'. That's unfortunate wording IMHO.

  • Comment on Re^13: Seeking Perl docs about how UTF8 flag propagates (Terminology)
  • Download Code

Replies are listed 'Best First'.
Re^14: Seeking Perl docs about how UTF8 flag propagates (Terminology)
by ikegami (Patriarch) on May 19, 2023 at 15:20 UTC

    Well your stars are more golden than mine.

    I meant that you were praising something filled with problems.

    And I cringe about calling a byte a character.

    I'm not sure what you mean?

    Maybe you're disliking the fact that characters are elements of strings. But the Terminology you linked is consistent with that. And it's accurate. A character has always been an element of a string. No matter what it represents. "Character" doesn't imply semantics.

    • substr: "First character is at offset zero."
    • chop: "Chops off the last character of a string and returns the character chopped."
    • ord: "Returns the numeric value of the first character of EXPR."
    • regex dot: "Match any single character [...]"
    • etc
    • etc
    • etc

    This is not just a Perl thing either.

    it's the best in having an axiomatic build up with clarifying the terminology first.

    Yes, but that's not what we were discussing. We we discussing the quality of the terminology in question.

      > > And I cringe about calling a byte a character.

      > I'm not sure what you mean?

      A byte is commonly defined as 8 bits.

      Encode says

      • byte

        A character in the range 0..255

      While most other sources - like WP - say "UTF-8 is encoding characters using one to four one-byte units".

      This will make reader stumble into paradoxical mental loops. ( -> a character is encoded by 1-4 characters ... WTF? )

      As you pointed out are most (not all) string operators in Perl "character" based. (Those should maybe be better called "text operators")

      What comprises a "character" in a Perl-string depends on the "UTF8 flag"

      • Without it's one byte
      • with it's one to four bytes.

      If a terminology invites for misunderstandings one should chose a new word.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        Something to keep in mind is that the fact that a term can mean two different things doesn't necessarily mean it's a confusing term. If only one of each definition is used in a given context, then there's no confusion. For example, it would be perfectly fine to say that pack( "N", 200 ) returns a string of 4 bytes. Doesn't matter how many bytes of storage is used for it. It's still the same string of 4 bytes if you upgrade the resulting scalar. Because we're not talking about how it's stored. We're talking about how it would be stored in a stream. So yeah, the string of 4 bytes might use 5 bytes of storage in the string buffer. And the same the string of 4 bytes takes up 42 bytes of memory (according to Devel::Size). Yet none of that is confusing.


        a character is encoded by 1-4 characters ... WTF? )

        Would you say "I put the vehicle on the vehicle", or would you say "I drove my car onto the ferry"?

        The fact that the process involves string operations and thus characters isn't relevant. You always have the option of being more specific (e.g. using Code Point of character, and byte instead of character) if it makes things clearer.

        That's why one would say a Code Point encodes to one to four bytes.

        with it's one to four bytes.

        Each character takes up 1 to 13 bytes of storage, actually.

        As you pointed out are most (not all) string operators in Perl "character" based.

        I did not sat that. Quite the opposite. All string operations deal with characters. By definition. A string is made up of characters. I literally said that's the name of the elements of a string.

        If a terminology invites for misunderstandings one should chose a new word.

        I welcome an unambiguous word for "string element". But until one comes around, I shall continue defining the terminology I use, and part of that is defining a character to be a string element. Cause no one wants me to say string elements.

        But did you notice that the terminology I said I use didn't mention "character" at all? I think you're trying to convince me that "character" is confusing, yet I didn't even use the word! It's a term that can usually be avoided entirely. It only comes into play when dealing with strings of arbitrary content. There's usually no reason for a term for string elements otherwise.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11152290]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-19 11:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found