Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^12: Seeking Perl docs about how UTF8 flag propagates (Terminology)

by ikegami (Patriarch)
on May 18, 2023 at 16:42 UTC ( [id://11152277]=note: print w/replies, xml ) Need Help??


in reply to Re^11: Seeking Perl docs about how UTF8 flag propagates (Terminology)
in thread Seeking Perl docs about how UTF8 flag propagates

Really?

Character's definition is wrong. Perl has 32 or 64 bit chars, not 32 bit chars. And these days, it's usually 64 bit chars. Theoretically, the encoding supports 72 bit chars, but they gotta fit in a UV for Perl to be able to work with them, so the size of a UV controls the range of a char.

It's also not that clear. The important part is that a character is an element of a string. For example, substr( $_, $i, 1 ) returns the character at offset $i. I would also mention the range, but only after saying it's the element of a string. We could also mention chr and ord as means of switching between representation of that chararacter. But, while I disagree with the wording, I do agree with the term and its meaning.

Then there's byte and octet, and I can't tell how they are different from each other from the Terminology section. And anyone that uses different these two words to mean two different things needs to find better terms.

And that's it? Where's the rest? The phrases we actually need?

If we continue on to next section it starts with saying that encode "Encodes the scalar value STRING from Perl's internal form into ENCODING and returns a sequence of octets." wtf does the internal form has to do with anything? The input is expected to be a string of Unicode Code Points. The internal form of those UCP is not relevant.

Four sentences, and five problems. And that's not counting all the missing terminology. This is not what I was expecting when you gave it a gold star.

Replies are listed 'Best First'.
Re^13: Seeking Perl docs about how UTF8 flag propagates (Terminology)
by LanX (Saint) on May 18, 2023 at 22:11 UTC
    > This is not what I was expecting when you gave it a gold star.

    Well your stars are more golden than mine.

    Seriously, it says 32bits or more° and I don't care that much as long as Perl and Unicode aren't expanded to cover all scripts of the galaxy.

    And I cringe about calling a byte a character. A string - as a sequence of bytes˛ - can hold any kind of packed data which fits into memory. Like eg JPG. Perl has also plenty of string operators which don't assume text.

    BUT ... "from all docs I skimmed thru yet" ... it's the best in having an axiomatic build up with clarifying the terminology first.

    And as I said "taking the style as starting point."

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

    °) "...range 0 .. 2**32-1 (or more)"

    ˛) the idea seems to be to define a "logical" character as the sub-units of strings as returned by split // or 'length'. That's unfortunate wording IMHO.

      Well your stars are more golden than mine.

      I meant that you were praising something filled with problems.

      And I cringe about calling a byte a character.

      I'm not sure what you mean?

      Maybe you're disliking the fact that characters are elements of strings. But the Terminology you linked is consistent with that. And it's accurate. A character has always been an element of a string. No matter what it represents. "Character" doesn't imply semantics.

      • substr: "First character is at offset zero."
      • chop: "Chops off the last character of a string and returns the character chopped."
      • ord: "Returns the numeric value of the first character of EXPR."
      • regex dot: "Match any single character [...]"
      • etc
      • etc
      • etc

      This is not just a Perl thing either.

      it's the best in having an axiomatic build up with clarifying the terminology first.

      Yes, but that's not what we were discussing. We we discussing the quality of the terminology in question.

        > > And I cringe about calling a byte a character.

        > I'm not sure what you mean?

        A byte is commonly defined as 8 bits.

        Encode says

        • byte

          A character in the range 0..255

        While most other sources - like WP - say "UTF-8 is encoding characters using one to four one-byte units".

        This will make reader stumble into paradoxical mental loops. ( -> a character is encoded by 1-4 characters ... WTF? )

        As you pointed out are most (not all) string operators in Perl "character" based. (Those should maybe be better called "text operators")

        What comprises a "character" in a Perl-string depends on the "UTF8 flag"

        • Without it's one byte
        • with it's one to four bytes.

        If a terminology invites for misunderstandings one should chose a new word.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11152277]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-04-25 06:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found