Re^2: Seeking Perl docs about how UTF8 flag propagates

Replies are listed 'Best First'.
Re^3: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 15, 2023 at 22:10 UTC
certain functions (e.g., lc) change their behavior depending on how the flag is set. Yes, this is indeed the fly in the ointment. As far as I know such cases are documented - and in this case at least the documentation describes mechanisms that force it to behave one way or another independent of the UTF8 flag (eg `use bytes` versus `use feature 'unicode_strings'`). Those aspects of Perl that requires you to know the state of the UTF8 flag are collectively known as "the Unicode bug", and there is more detail in a section devoted to this in perlunicode.	[reply] [d/l] [select]
Re^4: Seeking Perl docs about how UTF8 flag propagates by raygun (Scribe) on May 16, 2023 at 02:17 UTC
Those aspects of Perl that requires you to know the state of the UTF8 flag But I'm not asking "what is the state of the UTF8 flag." I'm asking, "Does a given operation preserve the state?" `$str2 = $str; $str3 = sprintf ("str is %s", $str); @words = split(/ /, $str);` [download] Do `$str2`, `$str3`, and `$words[2]` have the same flag value as `$str`? Does it depend on other factors? Is it undefined? (I suppose since it's intentionally undocumented, it's at least theoretically undefined.)	[reply] [d/l] [select]
Re^5: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 16, 2023 at 02:50 UTC
I'd need to (laboriously) check the source for chapter and verse, but as far as I remember in all the obvious cases when any of the inputs have UTF8 on, the output will too. Here's an example commonly used in perl's tests to create a UTF8-flagged string by appending a flagged zero-length string: `% perl -MDevel::Peek -wle ' $x="\x{100}"; Dump($x); chop $x; Dump($x); $y = "foo"; Dump($y); $y .= $x; Dump($y) ' 2>&1 \| grep FLAGS FLAGS = (POK,IsCOW,pPOK,UTF8) FLAGS = (POK,pPOK,UTF8) FLAGS = (POK,IsCOW,pPOK) FLAGS = (POK,pPOK,UTF8) %` [download] Your examples certainly all appear to propagate the flag. However `substr()` appears to propagate it only if the ~~resulting substring~~ source string has characters above 0x7f: I have no idea why that appears to be an exception. And I also do not know of any guarantee that any of these behaviours will be retained in future perl versions (though I think it is hugely unlikely that the steps involved in the code above will change, due to its widespread use in core). Update: from perl source, it appears `substr` returns a UTF8_off result if the source string has byte length and character length the same.	[reply] [d/l] [select]
Re^6: Seeking Perl docs about how UTF8 flag propagates by choroba (Cardinal) on May 17, 2023 at 08:07 UTC
Re^7: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 17, 2023 at 12:13 UTC
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 17, 2023 at 11:21 UTC
Re^6: Seeking Perl docs about how UTF8 flag propagates by raygun (Scribe) on May 16, 2023 at 05:56 UTC
Re^3: Seeking Perl docs about how UTF8 flag propagates by ikegami (Patriarch) on May 17, 2023 at 16:37 UTC
If that's the intent It is. Code that behaves differently based on the internal storage format is said to suffer from The Unicode Bug. it doesn't always work in practice True. Notably, the operators that accept file names. And of course, some XS modules. `utf8::upgrade` and `utf8::downgrade` can be used to work around these bugs. certain functions (e.g., lc) change their behavior depending on how the flag is set. `lc`, `uc` and the regex engine were fixed in 5.14, released in 2011 (12 years ago). To get the fix, you need to use `use v5.14;`, or `use feature qw( unicode_strings );` more specifically. (The feature actually appeared in 5.12, but it didn't fix as many things in 5.12 as in 5.14, so I pretend it was added in 5.14.)	[reply] [d/l] [select]
Re^3: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 22:22 UTC
Could you please provide an example where `lc` behaves different, depending on the flag? As far as I know `lc` will simply preserve the flag of the input (I am not sure whether this holds on EBCDIC platforms). The opposite function, `uc`, is known to set the flag for a (non-flagged) input of `chr 0xFF` or 'ÿ': Its uppercase equivalent 'Ÿ' is not present in ISO-8859-1, but taken from the Unicode block Latin Extended-A.	[reply]
Re^4: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 17, 2023 at 00:22 UTC
Not sure about `lc()`, but here's another case where the closely-related `uc()` behaves differently: `$ascii = "\x{df}"; chop($utfer = "\x{100}"); $utf = $ascii . $utfer; print uc($_) for ($ascii, $utf);` [download] As a Unicode codepoint, "\x{df}" is interpreted as the lowercase German "es-zed" character (ß), which uppercases to "SS". As an ASCII codepoint it is seen as a non-word character, and does not change. This is a rare case where changing the case of a string also changes its length.	[reply] [d/l] [select]
Re^5: Seeking Perl docs about how UTF8 flag propagates by hippo (Archbishop) on May 17, 2023 at 06:46 UTC
As an ASCII codepoint Nitpick: it isn't ASCII. I suspect you meant either ISO-8859-1 or Latin-1 or non-Unicode instead of ASCII which has a highest codepoint of `\x{7f}`. 🦛	[reply] [d/l]
Re^6: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 17, 2023 at 06:51 UTC
Re^5: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 17, 2023 at 06:34 UTC
Ah, interesting. I missed that because it behaves differently depending on the use of `feature`s: `haj@vdesktop:~$ perl -M5.010 -C -e 'print uc chr 0xdf, "\n"' ß haj@vdesktop:~$ perl -M5.012 -C -e 'print uc chr 0xdf, "\n"' SS` [download]	[reply] [d/l]
Re^3: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 15, 2023 at 21:49 UTC
> when certain functions (e.g., lc) change their behavior depending on how the flag is set That's the point you seem to be missing. The function `length` must report different numbers of characters, if 2-4 bytes are supposed to represent a unicode entity because of the utf8-flag. Same for other functions. Otherwise please be more specific about what `lc` does wrongly... Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^4: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 16, 2023 at 02:19 UTC
The function `length()` always reports the number of characters in the string: you do not need to know whether the UTF8 flag is set on your string to understand what it will do. The function `lc()` on the other hand will give different results for the same string (ie a string consisting of the same characters) depending on whether the UTF8 flag is set or not. As such it is an example of the Unicode bug in action. This is not necessarily wrong - it is after all documented behaviour. But it does mean that, despite the intent, the programmer needs to know how the UTF8 flag will have been set to correctly predict the behaviour of `lc()` on strings containing certain characters.	[reply] [d/l] [select]
Re^5: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 09:48 UTC
I'm puzzled. Of course length can also give you different results if you change the UTF8 flag. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^6: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 16, 2023 at 12:47 UTC
Re^7: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 13:35 UTC
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 15:10 UTC
Some notes below your chosen depth have not been shown here
Re^6: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 22:08 UTC
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 17, 2023 at 01:06 UTC
Some notes below your chosen depth have not been shown here
Re^4: Seeking Perl docs about how UTF8 flag propagates by raygun (Scribe) on May 16, 2023 at 01:51 UTC
Otherwise please be more specific about what lc does wrongly... As far as I know it does nothing wrongly. It just does things differently depending on previous operations whose effects are not fully documented.	[reply]
Re^5: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 09:56 UTC
To answer your main question, I would be very surprised if there where normal cases where the UTF8-Flag isn't preserved when passing around. So no need to document the obvious. Without UTF8-Flag it's a octet-stream and all string commands will treat every single byte (for backward compatibility) as (some) ASCII character. Otherwise it's a "character-string" and Perl will use the internal UTF8 format to map one or more bytes to characters Easy. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^6: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 11:20 UTC
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 22, 2023 at 14:18 UTC
Some notes below your chosen depth have not been shown here
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 11:34 UTC
Re^6: Seeking Perl docs about how UTF8 flag propagates by raygun (Scribe) on May 16, 2023 at 21:40 UTC