Re^5: Seeking Perl docs about how UTF8 flag propagates

Replies are listed 'Best First'.
Re^6: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 16, 2023 at 11:20 UTC
Without UTF8-Flag it's a octet-stream and all string commands will treat every single byte (for backward compatibility) as (some) ASCII character Nitpick: Single bytes also work in the 128-255 range, so it is rather ISO-8859-1 than ASCII. For example, an (`ä`) encoded as `chr 0xE4` matches `qr/\w/`, according to its unicode property.	[reply]
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 22, 2023 at 14:18 UTC
> > Without UTF8-Flag it's a octet-stream and all string commands will treat every single byte (for backward compatibility) as (some) ASCII character > Nitpick: Single bytes also work in the 128-255 range, so it is rather ISO-8859-1 than ASCII. For example, an (ä) encoded as chr 0xE4 matches qr/\w/, according to its unicode property. I can't reproduce this, from what I see is `\w` defaulting to pure ASCII `C:\Users\rolflangsdorf>perl $a=chr(0xE4); print "$a matched \\w" if $a =~/^\w/; __END__ C:\Users\rolflangsdorf>perl -v This is perl 5, version 32, subversion 1 (v5.32.1) built for MSWin32-x +64-multi-thread` [download] Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^8: Seeking Perl docs about how UTF8 flag propagates by haj (Vicar) on May 22, 2023 at 15:45 UTC
As has been said in this thread (and elsewhere): To reproduce, `use 5.012;` or newer, or more explicitly: `use feature 'unicode_strings';` I have the habit to always specify a minimum version I run in my programs, including demos for PerlMonks. I admit that it didn't occur to me that without a version declaration (or with a declaration of 5.010 or older) Perl behaves differently.	[reply]
Re^9: Seeking Perl docs about how UTF8 flag propagates (update: unicode_rules ) by LanX (Saint) on May 22, 2023 at 16:13 UTC
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 16, 2023 at 11:34 UTC
yeah I was too lazy to look it up so I said some ASCII... (which is probably still technically incorrect) edit doesn't it depend on the locale settings? Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^6: Seeking Perl docs about how UTF8 flag propagates by raygun (Scribe) on May 16, 2023 at 21:40 UTC
I would be very surprised if there where normal cases where the UTF8-Flag isn't preserved when passing around. So no need to document the obvious. It sounds like you've confused "expected" with "obvious." "`Widget` can contain any ASCII character. This includes the semicolon." That second sentence is obvious—it's an easily deducible consequence of the first—so it need not (and should not) be stated. Conversely, you can expect things to happen a certain way, but software can sometimes defy your expectation—as hv's remark below about `substr` demonstrates. That doesn't mean the software is misbehaving; it just means your expectation didn't align with it. In contrast, if an ASCII-holding object fails to be able to contain a semicolon, that's beyond unexpected—that's a bug. And frankly, I don't think there is a clear expectation about whether things like `substr` and `split`, which create brand new strings out of pieces of existing strings, should blindly copy the UTF8 flag of their input. Ideally, "expected" behavior would still be documented. (That statement strays into tautologyland: documenting things is how users know to expect them.) If the behavior is expected but still intentionally undefined, that fact ought to be documented too, so that coders know not to rely on it. The case at hand is neither, which to me suggests a shortfall in the documentation. A sentence or two in perlunicode about what's not guaranteed regarding the UTF8 flag would solve this, and let coders ensure their code isn't making unwarranted assumptions.	[reply] [d/l] [select]

edit