Re^10: Seeking Perl docs about how UTF8 flag propagates (Terminology)

Replies are listed 'Best First'.
Re^11: Seeking Perl docs about how UTF8 flag propagates (Terminology) by LanX (Saint) on May 18, 2023 at 15:34 UTC
Regarding wording and reasoning, `Encode` seems to be the most consistent from all docs I've skimmed thru yet. It starts by clarifying it's Terminology and seems to stay true to it. It would be nice to find a way to harmonize the other docs based on this one. At least taking `Encode` 's style as starting point. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^12: Seeking Perl docs about how UTF8 flag propagates (Terminology) by ikegami (Patriarch) on May 18, 2023 at 16:42 UTC
Really? Character's definition is wrong. Perl has 32 or 64 bit chars, not 32 bit chars. And these days, it's usually 64 bit chars. Theoretically, the encoding supports 72 bit chars, but they gotta fit in a UV for Perl to be able to work with them, so the size of a UV controls the range of a char. It's also not that clear. The important part is that a character is an element of a string. For example, `substr( $_, $i, 1 )` returns the character at offset `$i`. I would also mention the range, but only after saying it's the element of a string. We could also mention `chr` and `ord` as means of switching between representation of that chararacter. But, while I disagree with the wording, I do agree with the term and its meaning. Then there's byte and octet, and I can't tell how they are different from each other from the Terminology section. And anyone that uses different these two words to mean two different things needs to find better terms. And that's it? Where's the rest? The phrases we actually need? If we continue on to next section it starts with saying that `encode` "Encodes the scalar value STRING from Perl's internal form into ENCODING and returns a sequence of octets." wtf does the internal form has to do with anything? The input is expected to be a string of Unicode Code Points. The internal form of those UCP is not relevant. Four sentences, and five problems. And that's not counting all the missing terminology. This is not what I was expecting when you gave it a gold star.	[reply] [d/l] [select]
Re^13: Seeking Perl docs about how UTF8 flag propagates (Terminology) by LanX (Saint) on May 18, 2023 at 22:11 UTC
> This is not what I was expecting when you gave it a gold star. Well your stars are more golden than mine. Seriously, it says 32bits or more° and I don't care that much as long as Perl and Unicode aren't expanded to cover all scripts of the galaxy. And I cringe about calling a byte a character. A string - as a sequence of bytes² - can hold any kind of packed data which fits into memory. Like eg JPG. Perl has also plenty of string operators which don't assume text. BUT ... "from all docs I skimmed thru yet" ... it's the best in having an axiomatic build up with clarifying the terminology first. And as I said "taking the style as starting point." Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery} °) "...range 0 .. 2**32-1 (or more)" ²) the idea seems to be to define a "logical" character as the sub-units of strings as returned by `split //` or 'length'. That's unfortunate wording IMHO.	[reply] [d/l]
Re^14: Seeking Perl docs about how UTF8 flag propagates (Terminology) by ikegami (Patriarch) on May 19, 2023 at 15:20 UTC
Re^15: Seeking Perl docs about how UTF8 flag propagates (Terminology) by LanX (Saint) on May 20, 2023 at 11:05 UTC
Some notes below your chosen depth have not been shown here
Re^11: Seeking Perl docs about how UTF8 flag propagates (Terminology) by LanX (Saint) on May 22, 2023 at 14:13 UTC
> For example, \w only works when applied to a string of decoded text. I think it's better to say that \w defaults to ASCII. So if the encoded text is ASCII it'll "work". If it's Latin-1, \w won't match the extra alphanumerics. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^12: Seeking Perl docs about how UTF8 flag propagates (Terminology) by ikegami (Patriarch) on May 22, 2023 at 16:56 UTC
That's wrong. It doesn't "default to ASCII". It works against decoded text aka string of Unicode Code Points. Always. This can be demonstrated using `"\N{U+100}" =~ /\w/` (which matches). You need to use `/a` if to limit it to the ASCII range. Text encoded using ASCII happens to work because `$x eq encode( "US-ASCII", $x )`. Text encoded using iso-latin-1 happens to work because `$x eq encode( "iso-latin-1", $x )` (though do see last paragraph). Those are just side effects of `\w` working on decoded text. There was a bug where `\w` didn't work for characters in U+0080..U+00FF sometimes. This was fixed 12 years ago in 2011. Add `use v5.14;` to get the fix.	[reply] [d/l] [select]
Re^13: Seeking Perl docs about how UTF8 flag propagates (Terminology) by LanX (Saint) on May 22, 2023 at 20:50 UTC
> There was a bug where \w didn't work for characters in U+0080..U+00FF sometimes. This was fixed 12 years ago in 2011. Add use v5.14; to get the fix. I call default anything without pragmas. But we can agree that the default is buggy. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^14: Seeking Perl docs about how UTF8 flag propagates (Terminology) by ikegami (Patriarch) on May 23, 2023 at 01:27 UTC
Re^15: Seeking Perl docs about how UTF8 flag propagates (Terminology) by LanX (Saint) on May 23, 2023 at 10:18 UTC
Some notes below your chosen depth have not been shown here