in reply to Re^2: Seeking Perl docs about how UTF8 flag propagates
in thread Seeking Perl docs about how UTF8 flag propagates

certain functions (e.g., lc) change their behavior depending on how the flag is set.

Yes, this is indeed the fly in the ointment. As far as I know such cases are documented - and in this case at least the documentation describes mechanisms that force it to behave one way or another independent of the UTF8 flag (eg use bytes versus use feature 'unicode_strings').

Those aspects of Perl that requires you to know the state of the UTF8 flag are collectively known as "the Unicode bug", and there is more detail in a section devoted to this in perlunicode.

Replies are listed 'Best First'.
Re^4: Seeking Perl docs about how UTF8 flag propagates
by raygun (Scribe) on May 16, 2023 at 02:17 UTC
    Those aspects of Perl that requires you to know the state of the UTF8 flag
    But I'm not asking "what is the state of the UTF8 flag." I'm asking, "Does a given operation preserve the state?"
    $str2 = $str; $str3 = sprintf ("str is %s", $str); @words = split(/ /, $str);
    Do $str2, $str3, and $words[2] have the same flag value as $str? Does it depend on other factors? Is it undefined? (I suppose since it's intentionally undocumented, it's at least theoretically undefined.)

      I'd need to (laboriously) check the source for chapter and verse, but as far as I remember in all the obvious cases when any of the inputs have UTF8 on, the output will too.

      Here's an example commonly used in perl's tests to create a UTF8-flagged string by appending a flagged zero-length string:

      % perl -MDevel::Peek -wle ' $x="\x{100}"; Dump($x); chop $x; Dump($x); $y = "foo"; Dump($y); $y .= $x; Dump($y) ' 2>&1 | grep FLAGS FLAGS = (POK,IsCOW,pPOK,UTF8) FLAGS = (POK,pPOK,UTF8) FLAGS = (POK,IsCOW,pPOK) FLAGS = (POK,pPOK,UTF8) %

      Your examples certainly all appear to propagate the flag. However substr() appears to propagate it only if the resulting substring source string has characters above 0x7f: I have no idea why that appears to be an exception. And I also do not know of any guarantee that any of these behaviours will be retained in future perl versions (though I think it is hugely unlikely that the steps involved in the code above will change, due to its widespread use in core).

      Update: from perl source, it appears substr returns a UTF8_off result if the source string has byte length and character length the same.

        I'd need to (laboriously) check the source for chapter and verse,
        Please don't go to the effort for my sake. Do it if you're curious, or if a laborious code audit is your idea of a good time.
        but as far as I remember in all the obvious cases when any of the inputs have UTF8 on, the output will too.
        Thank you! That's very useful, and probably good enough for my current project (though I wouldn't rely on it for enterprise-level code).
        > However substr() appears to propagate it only if the resulting substring has characters above 0x7f

        What do you mean by "appears"?

        I tried the following:

        #!/usr/bin/perl use warnings; use strict; use Devel::Peek; my $s = "\N{LATIN SMALL LETTER S WITH CARON}i\N{LATIN SMALL LETTER C W +ITH CARON}"; for my $i (0 .. 2) { my $c = substr $s, $i, 1; Dump($c); }

        Running it through  2>&1 | grep FLAGS outputs

        FLAGS = (POK,pPOK,UTF8) FLAGS = (POK,pPOK,UTF8) FLAGS = (POK,pPOK,UTF8)

        Update: Fixed the encoding of the code.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]