Re^3: Seeking Perl docs about how UTF8 flag propagates

Replies are listed 'Best First'.
Re^4: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 16, 2023 at 02:19 UTC
The function `length()` always reports the number of characters in the string: you do not need to know whether the UTF8 flag is set on your string to understand what it will do. The function `lc()` on the other hand will give different results for the same string (ie a string consisting of the same characters) depending on whether the UTF8 flag is set or not. As such it is an example of the Unicode bug in action. This is not necessarily wrong - it is after all documented behaviour. But it does mean that, despite the intent, the programmer needs to know how the UTF8 flag will have been set to correctly predict the behaviour of `lc()` on strings containing certain characters.	[reply] [d/l] [select]
Re^5: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 09:48 UTC
I'm puzzled. Of course length can also give you different results if you change the UTF8 flag. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^6: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 16, 2023 at 12:47 UTC
I believe you are thinking in terms of particular encodings, which knowledge of the Unicode bug wrongly tempts one to do. I'm not talking about taking an internal representation as a sequence of bytes and then flipping the UTF8 flag on that internal representation, I'm talking about the actual strings represented by the internal representation. A string in Perl is a sequence of characters, not the sequence of bytes (or octets) that represents those characters in a particular encoding. If `length()` gives a different answer on two strings, then they are not the same sequence of characters. In a Unicode world there is one string `"fu\x{df}"` consisting of three characters. Internally Perl might encode that in one of two different ways, resulting in different byte sequences and a different setting of the UTF8 flag, but it is the same string whichever encoding is used. So in the code below, I would expect `verify_upgraded_length` and `verify_downgraded_length` to return a TRUE value for every string input (if they return at all). `use utf8 (); sub verify_upgraded_length { my($s) = @_; # wrong # my $u = utf8::upgrade($s); my $u = $s; utf8::upgrade($u); return length($s) == length($u); } sub verify_downgraded_length { my($s) = @_; # wrong # my $d = utf8::downgrade($s); # dies if downgrade not possible my $d = $s; utf8::downgrade($d); # dies if downgrade not possible return length($s) == length($d); }` [download] [Updated: corrected code, thanks haj++]	[reply] [d/l] [select]
Re^7: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 13:35 UTC
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 15:10 UTC
Re^8: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 16, 2023 at 15:35 UTC
Re^8: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 18:18 UTC
Some notes below your chosen depth have not been shown here
Re^6: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 22:08 UTC
Of course length can also give you different results if you change the UTF8 flag. Could you please give an example for this? How do you "change the UTF8 flag"? `utf8::encode` and `utf8::decode` change the flag, but also the characters (same for Encode). So, obviously, the length can change. `utf8::upgrade` and `utf8::downgrade` change the flag and the internal representation, but keep the length.	[reply]
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 17, 2023 at 01:06 UTC
Re^8: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 17, 2023 at 03:51 UTC
Some notes below your chosen depth have not been shown here
Re^8: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 17, 2023 at 06:46 UTC
Some notes below your chosen depth have not been shown here
Re^4: Seeking Perl docs about how UTF8 flag propagates by raygun (Scribe) on May 16, 2023 at 01:51 UTC
Otherwise please be more specific about what lc does wrongly... As far as I know it does nothing wrongly. It just does things differently depending on previous operations whose effects are not fully documented.	[reply]
Re^5: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 09:56 UTC
To answer your main question, I would be very surprised if there where normal cases where the UTF8-Flag isn't preserved when passing around. So no need to document the obvious. Without UTF8-Flag it's a octet-stream and all string commands will treat every single byte (for backward compatibility) as (some) ASCII character. Otherwise it's a "character-string" and Perl will use the internal UTF8 format to map one or more bytes to characters Easy. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^6: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 11:20 UTC
Without UTF8-Flag it's a octet-stream and all string commands will treat every single byte (for backward compatibility) as (some) ASCII character Nitpick: Single bytes also work in the 128-255 range, so it is rather ISO-8859-1 than ASCII. For example, an (`ä`) encoded as `chr 0xE4` matches `qr/\w/`, according to its unicode property.	[reply]
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 22, 2023 at 14:18 UTC
Re^8: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 22, 2023 at 15:45 UTC
Some notes below your chosen depth have not been shown here
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 11:34 UTC
Re^6: Seeking Perl docs about how UTF8 flag propagates by raygun (Scribe) on May 16, 2023 at 21:40 UTC
I would be very surprised if there where normal cases where the UTF8-Flag isn't preserved when passing around. So no need to document the obvious. It sounds like you've confused "expected" with "obvious." "`Widget` can contain any ASCII character. This includes the semicolon." That second sentence is obvious—it's an easily deducible consequence of the first—so it need not (and should not) be stated. Conversely, you can expect things to happen a certain way, but software can sometimes defy your expectation—as hv's remark below about `substr` demonstrates. That doesn't mean the software is misbehaving; it just means your expectation didn't align with it. In contrast, if an ASCII-holding object fails to be able to contain a semicolon, that's beyond unexpected—that's a bug. And frankly, I don't think there is a clear expectation about whether things like `substr` and `split`, which create brand new strings out of pieces of existing strings, should blindly copy the UTF8 flag of their input. Ideally, "expected" behavior would still be documented. (That statement strays into tautologyland: documenting things is how users know to expect them.) If the behavior is expected but still intentionally undefined, that fact ought to be documented too, so that coders know not to rely on it. The case at hand is neither, which to me suggests a shortfall in the documentation. A sentence or two in perlunicode about what's not guaranteed regarding the UTF8 flag would solve this, and let coders ensure their code isn't making unwarranted assumptions.	[reply] [d/l] [select]