Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Seeking Perl docs about how UTF8 flag propagates

by raygun (Scribe)
on May 15, 2023 at 18:55 UTC ( [id://11152194]=perlquestion: print w/replies, xml ) Need Help??

raygun has asked for the wisdom of the Perl Monks concerning the following question:

Is there any Perl documentation that clearly specifies when the UTF8 flag is propagated from one string to another? One would expect it to propagate in a simple assignment (though even this I can't find a clear statement in the perldocs about); what about a function that takes a string and returns either another string or an array of strings? One would hope documentation for such functions would address this, but the docs for, e.g., sprintf and split do not.

perlunicode addresses strings "from an external source marked as Unicode," but says nothing about the more common case of internal sources, and I haven't found this explained straightforwardly in several other Unicode-related perldocs I've scoured. What have I overlooked?

  • Comment on Seeking Perl docs about how UTF8 flag propagates

Replies are listed 'Best First'.
Re: Seeking Perl docs about how UTF8 flag propagates
by hv (Prior) on May 15, 2023 at 19:13 UTC

    What have I overlooked?

    The principle is that you don't need to know: it's an internal flag, and you just need to trust that strings will behave as they should. The intent is that when the UTF8 flag is turned off, this is purely an optimization that allows the internals to do various things in a simpler, faster way. So that's why the docs aren't littered with discussions of what effect each operation has on the flag.

    Do you have a specific reason for wanting to know the state of the flag in particular cases? I'm sure we can help answer questions about specifics.

      The intent is that when the UTF8 flag is turned off, this is purely an optimization that allows the internals to do various things in a simpler, faster way.
      If that's the intent, it doesn't always work in practice, when certain functions (e.g., lc) change their behavior depending on how the flag is set. But I take your point that because this is the intent (whether it works that way in practice or not), the info I'm seeking is undocumented. So I have the answer I need, thank you.

        certain functions (e.g., lc) change their behavior depending on how the flag is set.

        Yes, this is indeed the fly in the ointment. As far as I know such cases are documented - and in this case at least the documentation describes mechanisms that force it to behave one way or another independent of the UTF8 flag (eg use bytes versus use feature 'unicode_strings').

        Those aspects of Perl that requires you to know the state of the UTF8 flag are collectively known as "the Unicode bug", and there is more detail in a section devoted to this in perlunicode.

        If that's the intent

        It is. Code that behaves differently based on the internal storage format is said to suffer from The Unicode Bug.

        it doesn't always work in practice

        True. Notably, the operators that accept file names. And of course, some XS modules.

        utf8::upgrade and utf8::downgrade can be used to work around these bugs.

        certain functions (e.g., lc) change their behavior depending on how the flag is set.

        lc, uc and the regex engine were fixed in 5.14, released in 2011 (12 years ago).

        To get the fix, you need to use use v5.14;, or use feature qw( unicode_strings ); more specifically.

        (The feature actually appeared in 5.12, but it didn't fix as many things in 5.12 as in 5.14, so I pretend it was added in 5.14.)

        Could you please provide an example where lc behaves different, depending on the flag?

        As far as I know lc will simply preserve the flag of the input (I am not sure whether this holds on EBCDIC platforms).

        The opposite function, uc, is known to set the flag for a (non-flagged) input of chr 0xFF or 'ÿ': Its uppercase equivalent 'Ÿ' is not present in ISO-8859-1, but taken from the Unicode block Latin Extended-A.

        > when certain functions (e.g., lc) change their behavior depending on how the flag is set

        That's the point you seem to be missing.

        The function length must report different numbers of characters, if 2-4 bytes are supposed to represent a unicode entity because of the utf8-flag. Same for other functions.

        Otherwise please be more specific about what lc does wrongly...

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

Re: Seeking Perl docs about how UTF8 flag propagates
by demerphq (Chancellor) on May 17, 2023 at 13:41 UTC

    My understanding has always been that the substring of a string should share its utf8'ness, and that once turned on it stays on until explicitly turned off. Anything else I consider a bug.

    Having said that, it is good form to treat it as an uncertain value and when you need to care you should ensure the variable is the form you want/need. utf8::upgrade() and utf8::downgrade() and related function in Encode are your friend here. Just remember that while utf8::upgrade() should always work, utf8::downgrade() may not be possible and you may want to use utf8::encode() instead, depending on what you are doing.

    ---
    $world=~s/war/peace/g

      utf8::is_utf8

      Indicates with internal storage format is used by a scalar.

      USE: Debugging XS modules.

      utf8::upgrade

      Changes a scalar to use the upgraded string format (if it's not already) without changing the string.

      my $s = ...; my $t = $s; utf8::upgrade( $t ); say utf8::is_utf8( $t ) ?1:0; # 1 say $s eq $t ?1:0; # 1

      USE: Working around instances of The Unicode Bug.

      utf8::downgrade

      Changes a scalar to use the downgraded string format (if it's not already) without changing the string. Dies if it can't.

      my $s = ...; my $t = $s; utf8::downgrade( $t ); # Might croak say utf8::is_utf8( $t ) ?1:0; # 0 say $s eq $t ?1:0; # 1

      USE: Working around instances of The Unicode Bug.

      utf8::encode

      Encodes a string using utf8.

      Expects a string of arbitrary characters in either storage format.

      Produces a string of 8-bit characters in the downgraded format.

      USE: You should probably be encoding using the standard UTF-8 encoding instead of the Perl-specific utf8 encoding.

      utf8::decode

      Decodes a string encoded using utf8. Dies if it can't.

      Expects a string of 8-bit characters in either storage format.

      Produces a string of characters in the upgraded format.

      USE: utf8 is a Perl-specific encoding. Are sure the text isn't encode using the standard UTF-8 encoding?

      Encode::is_utf8

      Indicates with internal storage format is used by a scalar.

      USE: You might as well use the equivalent built-in utf8::is_utf8.

      Encode::_utf8_on

      Mostly equivalent to the following:

      utf8::decode( $_ ) if !utf8::is_utf8( $_ );

      The difference is that it produces a corrupt scalar if the string isn't valid utf8.

      USE: Do not use as it introduces The Unicode Bug.

      Encode::_utf8_off

      Equivalent to the following:

      utf8::encode( $_ ) if utf8::is_utf8( $_ );

      USE: Do not use as it introduces The Unicode Bug.

      My understanding has always been that the substring of a string should share its utf8'ness, and that once turned on it stays on until explicitly turned off.

      It makes sense to me that once turned on in a particular string, it should stay on (e.g., if the string is modified via $str =~ s///). Functions that give you substrings (e.g., substr, split) create new strings for these, so there is no "staying on" to be done. The flag value would have to be intentionally propagated from one string to another.

      Even a basic assignment creates a new string, but one would hope one of the properties of an assignment operator is that it duplicates both a variable's data and its metadata. (Yet even this fairly straightforward fact is not documented in perlop.)

      Anything else I consider a bug.
      By that logic, the behavior hv points out in Re^7: Seeking Perl docs about how UTF8 flag propagates is a bug. But since no documentation supports your expectation, I'm not sure you could make a case for that.
      Having said that, it is good form to treat it as an uncertain value
      Yeah, that's what I'm doing now, in response to this thread. Thanks to everyone who's chimed in.
Re: Seeking Perl docs about how UTF8 flag propagates
by hv (Prior) on May 19, 2023 at 17:36 UTC

    I started a thread on perl5-porters about this, SvUTF8 predictability, and it also lead to a fair bit of discussion. However I think this snippet from dave_the_m probably best sums up the likely consensus:

    I don't think its reasonable to document perl's behaviour vis-a-vis UTF8 flag behaviour. It will vary between releases, and it may well vary between different code paths (for example hypothetically rvalue and lvalue substr() might differ). It would also constrict any future bug fixes or optimisations.

    .. with an assumption that mechanisms such as The 'unicode_strings' feature will continue to be added and refined to reduce as far as possible any need to know.

Re: Seeking Perl docs about how UTF8 flag propagates
by ikegami (Patriarch) on May 17, 2023 at 15:18 UTC

    Is there any Perl documentation that clearly specifies when the UTF8 flag is propagated from one string to another?

    The UTF8 flag indicates which of two internal storage formats is being used.

    There's no documentation on the choice of internal storage format, because it's an internal detail. Perl is free to use the string storage format of its choice.

    If you need a specific storage format (i.e. to work around an instance of The Unicode Bug), then you can use utf8::upgrade or utf8::downgrade to ensure a specific storage format is used.


    what about a function that takes a string and returns either another string or an array of strings?

    Neither operators nor subs can return arrays, just scalars.

    Functions refers to name operators in Perl. Operator will tend to output strings in the same storage format as their operands, but there's no requirement for this. Mixing strings of different formats usually results in an upgraded string, as this formats supports all strings.

    Returning strings from a sub should have no effect on their storage format.

      There's no documentation on the choice of internal storage format, because it's an internal detail.
      ...except when it affects the documented behavior of functions like lc in certain situations.
      Neither operators nor subs can return arrays, just scalars.
      I don't follow your meaning here. In what sense is this not an array being returned:
      @words = split(/ /, 'This is a sentence.');
      Returning strings from a sub should have no effect on their storage format.
      ...yet sometimes does, per Re^7: Seeking Perl docs about how UTF8 flag propagates.

        ...except when it affects the documented behavior of functions like lc in certain situations.

        As I previously mentioned, lc is (intentionally) buggy (for backwards compatibility) when not using the unicode_strings feature. So yes, it makes sense to document it.

        In what sense is this not an array being returned:

        In every sense. Four scalars are returned by split, which are then assigned to an existing array.

        ..yet sometimes does

        No. split is a function (named operator), not a sub.

        But while I won't rule out the possibility of a change for subs, it's definitely not possible for a function. "Returning" scalars from a function definitely has no effect on their internal storage format. That post shows no evidence that returning a value had any effect on it whatsover. The returned scalars are as the function created them.

        I don't follow your meaning here. In what sense is this not an array being returned:

        raygun, in case you haven't run across this idea yet, Perl has a distinction between a LIST and an ARRAY: split returns a LIST, not an ARRAY.

        Per List value constructors, "List values are denoted by separating individual values by commas (and enclosing the list in parentheses where precedence requires it)", so a LIST is just the comma-separated sequence (which, as explained lower down in that same section, will interpolate any lists, arrays, or hashes that are included in the list).

        Per perldata's DESCRIPTION, an ARRAY is a datatype that contains "ordered lists of scalars indexed by number". The term can also be used to refer to any variable or anonymous data that has the ARRAY datatype.

        So in @words = split(/ /, 'This is a sentence.'); , the @words variable is an ARRAY variable, and it is being initialized by a LIST of scalars returned by the built-in function split.


        edit: created separated paragraphs for clarity

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11152194]
Approved by davies
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-20 05:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found