Re^5: Seeking Perl docs about how UTF8 flag propagates

I'd need to (laboriously) check the source for chapter and verse, but as far as I remember in all the obvious cases when any of the inputs have UTF8 on, the output will too.

Here's an example commonly used in perl's tests to create a UTF8-flagged string by appending a flagged zero-length string:

% perl -MDevel::Peek -wle '
  $x="\x{100}"; Dump($x);
  chop $x;      Dump($x);
  $y = "foo";   Dump($y);
  $y .= $x;     Dump($y)
' 2>&1 | grep FLAGS
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  FLAGS = (POK,pPOK,UTF8)
  FLAGS = (POK,IsCOW,pPOK)
  FLAGS = (POK,pPOK,UTF8)
%
[download]

Your examples certainly all appear to propagate the flag. However substr() appears to propagate it only if the ~~resulting substring~~ source string has characters above 0x7f: I have no idea why that appears to be an exception. And I also do not know of any guarantee that any of these behaviours will be retained in future perl versions (though I think it is hugely unlikely that the steps involved in the code above will change, due to its widespread use in core).

Update: from perl source, it appears substr returns a UTF8_off result if the source string has byte length and character length the same.

Comment on Re^5: Seeking Perl docs about how UTF8 flag propagates Select or Download Code

Replies are listed 'Best First'.
Re^6: Seeking Perl docs about how UTF8 flag propagates by choroba (Cardinal) on May 17, 2023 at 08:07 UTC
> However substr() appears to propagate it only if the resulting substring has characters above 0x7f What do you mean by "appears"? I tried the following: `#!/usr/bin/perl use warnings; use strict; use Devel::Peek; my $s = "\N{LATIN SMALL LETTER S WITH CARON}i\N{LATIN SMALL LETTER C W +ITH CARON}"; for my $i (0 .. 2) { my $c = substr $s, $i, 1; Dump($c); }` [download] Running it through `2>&1 \| grep FLAGS` outputs `FLAGS = (POK,pPOK,UTF8) FLAGS = (POK,pPOK,UTF8) FLAGS = (POK,pPOK,UTF8)` [download] Update: Fixed the encoding of the code. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^7: Seeking Perl docs about how UTF8 flag propagates by hv (Prior) on May 17, 2023 at 12:13 UTC
I looked at a small number of variations around the second case here: `$x = "foo\x{100}"; chop($y = $x); $z = substr($x, 0, 1); Dump($z); # UTF8 $a = substr($y, 0, 1); Dump($a); # not UTF8` [download] Looking at the perl source, it looks like it treats it differently (and ends up not flagging as UTF8) if the byte length and the character length of the whole source string are the same. (Which is a potential efficiency concern: finding the character length of a large UTF8-flagged string is expensive.)	[reply] [d/l]
Re^7: Seeking Perl docs about how UTF8 flag propagates by LanX (Saint) on May 17, 2023 at 11:21 UTC
to get rid of the OS-grep dependency, try `utf8::is_utf8` `use v5.12.0; use warnings; #use utf8; binmode STDOUT, ":utf8"; print my $s = "\N{LATIN SMALL LETTER S WITH CARON}i\N{LATIN SMALL LETT +ER C WITH CARON}"; say " is text" if utf8::is_utf8($s); for my $i (0 .. 2) { print my $c = substr $s, $i, 1; say " is text" if utf8::is_utf8($c); }` [download] šič is text š is text i is text č is text (in pre-tags b/c of PM restrictions) Please note that even after commenting `use utf8` out, $s is still automatically flagged as text The pragma is optional here (not recommended), because the functions are universally available in Perl there is no wide-character in the source-code Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^6: Seeking Perl docs about how UTF8 flag propagates by raygun (Scribe) on May 16, 2023 at 05:56 UTC
I'd need to (laboriously) check the source for chapter and verse, Please don't go to the effort for my sake. Do it if you're curious, or if a laborious code audit is your idea of a good time. but as far as I remember in all the obvious cases when any of the inputs have UTF8 on, the output will too. Thank you! That's very useful, and probably good enough for my current project (though I wouldn't rely on it for enterprise-level code).	[reply]