Re^5: Seeking Perl docs about how UTF8 flag propagates

Replies are listed 'Best First'.
Re^6: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 16, 2023 at 12:47 UTC
I believe you are thinking in terms of particular encodings, which knowledge of the Unicode bug wrongly tempts one to do. I'm not talking about taking an internal representation as a sequence of bytes and then flipping the UTF8 flag on that internal representation, I'm talking about the actual strings represented by the internal representation. A string in Perl is a sequence of characters, not the sequence of bytes (or octets) that represents those characters in a particular encoding. If `length()` gives a different answer on two strings, then they are not the same sequence of characters. In a Unicode world there is one string `"fu\x{df}"` consisting of three characters. Internally Perl might encode that in one of two different ways, resulting in different byte sequences and a different setting of the UTF8 flag, but it is the same string whichever encoding is used. So in the code below, I would expect `verify_upgraded_length` and `verify_downgraded_length` to return a TRUE value for every string input (if they return at all). `use utf8 (); sub verify_upgraded_length { my($s) = @_; # wrong # my $u = utf8::upgrade($s); my $u = $s; utf8::upgrade($u); return length($s) == length($u); } sub verify_downgraded_length { my($s) = @_; # wrong # my $d = utf8::downgrade($s); # dies if downgrade not possible my $d = $s; utf8::downgrade($d); # dies if downgrade not possible return length($s) == length($d); }` [download] [Updated: corrected code, thanks haj++]	[reply] [d/l] [select]
Re^7: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 13:35 UTC
Nitpicking (again): `utf8::upgrade` and `utf8::downgrade` do not return converted strings. They do the change inplace, and return a success indicator instead. One solution is to take a copy (also note that for current Perls, `use utf8` isn't needed): `sub verify_upgraded_length { my($s) = @_; my $u = $s; utf8::upgrade($u); return length($s) == length($u); } sub verify_downgraded_length { my($s) = @_; my $d = $s; utf8::downgrade($d); # dies if downgrade not possible return length($s) == length($d); }` [download]	[reply] [d/l]
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 15:10 UTC
> A string in Perl is a sequence of characters, not the sequence of bytes (or octets) that represents those characters in a particular encoding Unfortunately that's not the terminology of Perldocs! Text strings, or character strings are made of characters. Bytes are irrelevant here Binary strings, or byte strings are made of bytes. Here, you don't have characters, just bytes The latter (AKA Octet-Streams in other docs) lets `lc` work on "NON characters" b/c of backwards compatibility to Perl4. And the defining difference between both types of strings is the UTF8-Flag. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^8: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 16, 2023 at 15:35 UTC
That text is somewhat misleading - and note that it does not mention the UTF8 flag, indeed it explicitly says you "shouldn't worry" about what the internal format is. Perl does not distinguish between text strings and binary strings, but programmers may do so when deciding how to interpret the contents of a string. Thus they may interpret input received from an outside source as a sequence of utf8 octets, and decide that they need to decode it to get the desired sequence of characters. The string "\x{c3}\x{9f}" is a sequence of two characters; if the programmer interprets it as a sequence of utf8 octets, they might choose to decode it to get the string "\x{df}". Those are two different strings. However the string "\x{c3}\x{9f}" has two different possible internal representations, one with and one without the UTF8 flag enabled. It is the same string - the same sequence of characters - regardless of the internal representation. The same is true of the two different possible internal representations of "\x{df}". Any time the abstraction leaks out - any time you need to care about which internal representation is being used for the string - that's an example of the Unicode bug.	[reply]
Re^8: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 18:18 UTC
These days I'm working with binary data in Perl, so I'm a bit skeptical about such claims. Consider: `my $upper_a = "A";` In `$upper_a`, the UTF8 Flag is off. All character functions of Perl work on this variable, and that's not only for backwards compatibility. I'd rather call it a text string than a binary string. If I create a character with `$upper_a = "\N{U+41}"`, then the result has the UTF8 flag on. It behaves in no way different from `$upper_a` created before. Binary strings in Perl need careful treatment. Strings where the UTF8 flag is off qualify as binary strings - but these can also be used as character strings. Applying character functions to a binary string might switch the UTF8 flag on. The flag is contagious, one term where the flag is on (even when it is on an character from the ASCII set) is sufficient to have the result carrying the flag. Whether that's a problem depends on whether each position (to avoid the terms "byte" and "character" for the moment) holds a value of less than 256. If I create an `'ä'` as `\N{U+E4}`, then the UTF8 flag is on. Perl's internal `PV` is `"\303\244"` (`0xC3A4`). But still, it can be treated like a binary `0xE4`: It matches `/\xE4/`, and it can be printed without a "wide character" warning as a single byte.	[reply] [d/l]
Re^9: Seeking Perl docs about how UTF8 flag propagates (Terminology) by LanX (Saint) on May 17, 2023 at 17:15 UTC
Re^10: Seeking Perl docs about how UTF8 flag propagates (Terminology) by ikegami (Patriarch) on May 17, 2023 at 18:41 UTC
Some notes below your chosen depth have not been shown here
Re^6: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 22:08 UTC
Of course length can also give you different results if you change the UTF8 flag. Could you please give an example for this? How do you "change the UTF8 flag"? `utf8::encode` and `utf8::decode` change the flag, but also the characters (same for Encode). So, obviously, the length can change. `utf8::upgrade` and `utf8::downgrade` change the flag and the internal representation, but keep the length.	[reply]
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 17, 2023 at 01:06 UTC
> Could you please give an example for this? How do you "change the UTF8 flag"? `use v5.12.0; use warnings; #use Devel::Peek; use utf8; use Encode qw(is_utf8 _utf8_on _utf8_off); my $str = "ä"; say $str, ":",length($str); #Dump($str); _utf8_off($str); say $str, ":",length($str); #Dump($str);` [download] `ä:1 Ã¤:2` [download] Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^8: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 17, 2023 at 03:51 UTC
Where the documentation for those functions say "INTERNAL", that should be taken as shorthand for "GO AWAY. THIS IS A REALLY ******* BAD IDEA. PUT DOWN THE UTF8 FLAG AND BACK AWAY." It is really quite depressing that this is expressed in shorthand. It is not a good idea to use these functions. It is not a good idea to suggest anyone else uses these functions. These functions should almost certainly not exist: there are vanishingly few people that are competent to use them safely, and to the best of my knowledge those people would in all cases know (and prefer) other ways to achieve the same effects. The functions in the utf8 module (upgrade, downgrade, encode, decode) are vastly safer, for example.	[reply]
Re^9: Seeking Perl docs about how UTF8 flag propagates by Anonymous Monk on May 17, 2023 at 05:43 UTC
Re^8: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 17, 2023 at 06:46 UTC
Thanks! `Encode::_utf8_off` is indeed a way to break things (`use bytes;` or XS code being other possibilities). All of them come with the appropriate warning signs. So, it's nothing to worry about.	[reply]
Re^9: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 17, 2023 at 10:21 UTC
Re^10: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 17, 2023 at 11:27 UTC
Some notes below your chosen depth have not been shown here