Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Seeking Perl docs about how UTF8 flag propagates

by ikegami (Patriarch)
on May 17, 2023 at 15:18 UTC ( [id://11152258]=note: print w/replies, xml ) Need Help??


in reply to Seeking Perl docs about how UTF8 flag propagates

Is there any Perl documentation that clearly specifies when the UTF8 flag is propagated from one string to another?

The UTF8 flag indicates which of two internal storage formats is being used.

There's no documentation on the choice of internal storage format, because it's an internal detail. Perl is free to use the string storage format of its choice.

If you need a specific storage format (i.e. to work around an instance of The Unicode Bug), then you can use utf8::upgrade or utf8::downgrade to ensure a specific storage format is used.


what about a function that takes a string and returns either another string or an array of strings?

Neither operators nor subs can return arrays, just scalars.

Functions refers to name operators in Perl. Operator will tend to output strings in the same storage format as their operands, but there's no requirement for this. Mixing strings of different formats usually results in an upgraded string, as this formats supports all strings.

Returning strings from a sub should have no effect on their storage format.

Replies are listed 'Best First'.
Re^2: Seeking Perl docs about how UTF8 flag propagates
by raygun (Scribe) on May 17, 2023 at 21:39 UTC
    There's no documentation on the choice of internal storage format, because it's an internal detail.
    ...except when it affects the documented behavior of functions like lc in certain situations.
    Neither operators nor subs can return arrays, just scalars.
    I don't follow your meaning here. In what sense is this not an array being returned:
    @words = split(/ /, 'This is a sentence.');
    Returning strings from a sub should have no effect on their storage format.
    ...yet sometimes does, per Re^7: Seeking Perl docs about how UTF8 flag propagates.

      ...except when it affects the documented behavior of functions like lc in certain situations.

      As I previously mentioned, lc is (intentionally) buggy (for backwards compatibility) when not using the unicode_strings feature. So yes, it makes sense to document it.

      In what sense is this not an array being returned:

      In every sense. Four scalars are returned by split, which are then assigned to an existing array.

      ..yet sometimes does

      No. split is a function (named operator), not a sub.

      But while I won't rule out the possibility of a change for subs, it's definitely not possible for a function. "Returning" scalars from a function definitely has no effect on their internal storage format. That post shows no evidence that returning a value had any effect on it whatsover. The returned scalars are as the function created them.

      I don't follow your meaning here. In what sense is this not an array being returned:

      raygun, in case you haven't run across this idea yet, Perl has a distinction between a LIST and an ARRAY: split returns a LIST, not an ARRAY.

      Per List value constructors, "List values are denoted by separating individual values by commas (and enclosing the list in parentheses where precedence requires it)", so a LIST is just the comma-separated sequence (which, as explained lower down in that same section, will interpolate any lists, arrays, or hashes that are included in the list).

      Per perldata's DESCRIPTION, an ARRAY is a datatype that contains "ordered lists of scalars indexed by number". The term can also be used to refer to any variable or anonymous data that has the ARRAY datatype.

      So in @words = split(/ /, 'This is a sentence.'); , the @words variable is an ARRAY variable, and it is being initialized by a LIST of scalars returned by the built-in function split.


      edit: created separated paragraphs for clarity
        Perl has a distinction between a LIST and an ARRAY
        Thanks; I've seen both terms, and assumed them to be basically interchangeable—which in a lot of contexts I reckon they are. But I take your point that if split returned an array, you couldn't do something like ($word1, $word2, $word3, $word4) = split(/ /, 'This is a sentence.'). So thanks for the clarification.

        split returns a LIST, not an ARRAY.

        It does not return a "LIST". There's no such data structure. As I said, it returns scalars, which is to say it adds scalars to the stack.

        Colloquially, we do say it returns a list (of scalars). By that, we simply mean it returns (a number of) scalars.

        Scalars is the only thing being returned. No list. No array.

        (As for "LIST" spelled like that, the docs use this to refer to an expression evaluated in list context, such as the arguments to print. split most definitely does not return an expression.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11152258]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-04-26 06:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found