Seeking Perl docs about how UTF8 flag propagates

raygun has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Seeking Perl docs about how UTF8 flag propagates
by hv (Prior) on May 15, 2023 at 19:13 UTC

What have I overlooked?

The principle is that you don't need to know: it's an internal flag, and you just need to trust that strings will behave as they should. The intent is that when the UTF8 flag is turned off, this is purely an optimization that allows the internals to do various things in a simpler, faster way. So that's why the docs aren't littered with discussions of what effect each operation has on the flag.

Do you have a specific reason for wanting to know the state of the flag in particular cases? I'm sure we can help answer questions about specifics.

[reply]

Re^2: Seeking Perl docs about how UTF8 flag propagates

by raygun (Scribe) on May 15, 2023 at 19:37 UTC

The intent is that when the UTF8 flag is turned off, this is purely an optimization that allows the internals to do various things in a simpler, faster way.

[reply]

Re^3: Seeking Perl docs about how UTF8 flag propagates

by hv (Prior) on May 15, 2023 at 22:10 UTC

certain functions (e.g., lc) change their behavior depending on how the flag is set.

Yes, this is indeed the fly in the ointment. As far as I know such cases are documented - and in this case at least the documentation describes mechanisms that force it to behave one way or another independent of the UTF8 flag (eg use bytes versus use feature 'unicode_strings').

Those aspects of Perl that requires you to know the state of the UTF8 flag are collectively known as "the Unicode bug", and there is more detail in a section devoted to this in perlunicode.

[reply]
[d/l]
[select]

Re^4: Seeking Perl docs about how UTF8 flag propagates

by raygun (Scribe) on May 16, 2023 at 02:17 UTC

Re^5: Seeking Perl docs about how UTF8 flag propagates

by hv (Prior) on May 16, 2023 at 02:50 UTC

Some notes below your chosen depth have not been shown here

Re^3: Seeking Perl docs about how UTF8 flag propagates

by ikegami (Patriarch) on May 17, 2023 at 16:37 UTC

If that's the intent

It is. Code that behaves differently based on the internal storage format is said to suffer from The Unicode Bug.

it doesn't always work in practice

True. Notably, the operators that accept file names. And of course, some XS modules.

utf8::upgrade and utf8::downgrade can be used to work around these bugs.

certain functions (e.g., lc) change their behavior depending on how the flag is set.

lc, uc and the regex engine were fixed in 5.14, released in 2011 (12 years ago).

To get the fix, you need to use use v5.14;, or use feature qw( unicode_strings ); more specifically.

(The feature actually appeared in 5.12, but it didn't fix as many things in 5.12 as in 5.14, so I pretend it was added in 5.14.)

[reply]
[d/l]
[select]

Re^3: Seeking Perl docs about how UTF8 flag propagates

by haj (Vicar) on May 16, 2023 at 22:22 UTC

Could you please provide an example where lc behaves different, depending on the flag?

As far as I know lc will simply preserve the flag of the input (I am not sure whether this holds on EBCDIC platforms).

The opposite function, uc, is known to set the flag for a (non-flagged) input of chr 0xFF or '˙': Its uppercase equivalent 'ź' is not present in ISO-8859-1, but taken from the Unicode block Latin Extended-A.

[reply]

Re^4: Seeking Perl docs about how UTF8 flag propagates

by hv (Prior) on May 17, 2023 at 00:22 UTC

Re^5: Seeking Perl docs about how UTF8 flag propagates

by hippo (Bishop) on May 17, 2023 at 06:46 UTC

Some notes below your chosen depth have not been shown here

Re^5: Seeking Perl docs about how UTF8 flag propagates

by haj (Vicar) on May 17, 2023 at 06:34 UTC

Re^3: Seeking Perl docs about how UTF8 flag propagates

by LanX (Saint) on May 15, 2023 at 21:49 UTC

> when certain functions (e.g., lc) change their behavior depending on how the flag is set

That's the point you seem to be missing.

The function length must report different numbers of characters, if 2-4 bytes are supposed to represent a unicode entity because of the utf8-flag. Same for other functions.

Otherwise please be more specific about what lc does wrongly...

Cheers Rolf
_{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)

Wikisyntax for the Monastery}

[reply]

Re^4: Seeking Perl docs about how UTF8 flag propagates

by hv (Prior) on May 16, 2023 at 02:19 UTC

Re^5: Seeking Perl docs about how UTF8 flag propagates

by LanX (Saint) on May 16, 2023 at 09:48 UTC

Some notes below your chosen depth have not been shown here

Re^4: Seeking Perl docs about how UTF8 flag propagates

by raygun (Scribe) on May 16, 2023 at 01:51 UTC

Re^5: Seeking Perl docs about how UTF8 flag propagates

by LanX (Saint) on May 16, 2023 at 09:56 UTC

Some notes below your chosen depth have not been shown here

Re: Seeking Perl docs about how UTF8 flag propagates
by demerphq (Chancellor) on May 17, 2023 at 13:41 UTC

My understanding has always been that the substring of a string should share its utf8'ness, and that once turned on it stays on until explicitly turned off. Anything else I consider a bug.

Having said that, it is good form to treat it as an uncertain value and when you need to care you should ensure the variable is the form you want/need. utf8::upgrade() and utf8::downgrade() and related function in Encode are your friend here. Just remember that while utf8::upgrade() should always work, utf8::downgrade() may not be possible and you may want to use utf8::encode() instead, depending on what you are doing.

---
$world=~s/war/peace/g

[reply]

Re^2: Seeking Perl docs about how UTF8 flag propagates

by ikegami (Patriarch) on May 17, 2023 at 19:18 UTC

utf8::is_utf8	Indicates with internal storage format is used by a scalar. USE: Debugging XS modules.
utf8::upgrade	Changes a scalar to use the upgraded string format (if it's not already) without changing the string. `my $s = ...; my $t = $s; utf8::upgrade( $t ); say utf8::is_utf8( $t ) ?1:0; # 1 say $s eq $t ?1:0; # 1` [download] USE: Working around instances of The Unicode Bug.
utf8::downgrade	Changes a scalar to use the downgraded string format (if it's not already) without changing the string. Dies if it can't. `my $s = ...; my $t = $s; utf8::downgrade( $t ); # Might croak say utf8::is_utf8( $t ) ?1:0; # 0 say $s eq $t ?1:0; # 1` [download] USE: Working around instances of The Unicode Bug.
utf8::encode	Encodes a string using utf8. Expects a string of arbitrary characters in either storage format. Produces a string of 8-bit characters in the downgraded format. USE: You should probably be encoding using the standard UTF-8 encoding instead of the Perl-specific utf8 encoding.
utf8::decode	Decodes a string encoded using utf8. Dies if it can't. Expects a string of 8-bit characters in either storage format. Produces a string of characters in the upgraded format. USE: utf8 is a Perl-specific encoding. Are sure the text isn't encode using the standard UTF-8 encoding?
Encode::is_utf8	Indicates with internal storage format is used by a scalar. USE: You might as well use the equivalent built-in utf8::is_utf8.
Encode::_utf8_on	Mostly equivalent to the following: `utf8::decode( $_ ) if !utf8::is_utf8( $_ );` [download] The difference is that it produces a corrupt scalar if the string isn't valid utf8. USE: Do not use as it introduces The Unicode Bug.
Encode::_utf8_off	Equivalent to the following: `utf8::encode( $_ ) if utf8::is_utf8( $_ );` [download] USE: Do not use as it introduces The Unicode Bug.

[reply]
[d/l]
[select]

Re^2: Seeking Perl docs about how UTF8 flag propagates

by raygun (Scribe) on May 17, 2023 at 21:15 UTC

My understanding has always been that the substring of a string should share its utf8'ness, and that once turned on it stays on until explicitly turned off.

It makes sense to me that once turned on in a particular string, it should stay on (e.g., if the string is modified via $str =~ s///). Functions that give you substrings (e.g., substr, split) create new strings for these, so there is no "staying on" to be done. The flag value would have to be intentionally propagated from one string to another.

Even a basic assignment creates a new string, but one would hope one of the properties of an assignment operator is that it duplicates both a variable's data and its metadata. (Yet even this fairly straightforward fact is not documented in perlop.)

Anything else I consider a bug.

Re^7: Seeking Perl docs about how UTF8 flag propagates

Having said that, it is good form to treat it as an uncertain value

[reply]
[d/l]
[select]

Re: Seeking Perl docs about how UTF8 flag propagates
by hv (Prior) on May 19, 2023 at 17:36 UTC

I started a thread on perl5-porters about this, SvUTF8 predictability, and it also lead to a fair bit of discussion. However I think this snippet from dave_the_m probably best sums up the likely consensus:

I don't think its reasonable to document perl's behaviour vis-a-vis UTF8 flag behaviour. It will vary between releases, and it may well vary between different code paths (for example hypothetically rvalue and lvalue substr() might differ). It would also constrict any future bug fixes or optimisations.

.. with an assumption that mechanisms such as The 'unicode_strings' feature will continue to be added and refined to reduce as far as possible any need to know.

[reply]

Re: Seeking Perl docs about how UTF8 flag propagates
by ikegami (Patriarch) on May 17, 2023 at 15:18 UTC

Is there any Perl documentation that clearly specifies when the UTF8 flag is propagated from one string to another?

The UTF8 flag indicates which of two internal storage formats is being used.

There's no documentation on the choice of internal storage format, because it's an internal detail. Perl is free to use the string storage format of its choice.

If you need a specific storage format (i.e. to work around an instance of The Unicode Bug), then you can use utf8::upgrade or utf8::downgrade to ensure a specific storage format is used.

what about a function that takes a string and returns either another string or an array of strings?

Neither operators nor subs can return arrays, just scalars.

Functions refers to name operators in Perl. Operator will tend to output strings in the same storage format as their operands, but there's no requirement for this. Mixing strings of different formats usually results in an upgraded string, as this formats supports all strings.

Returning strings from a sub should have no effect on their storage format.

[reply]
[d/l]
[select]

Re^2: Seeking Perl docs about how UTF8 flag propagates

by raygun (Scribe) on May 17, 2023 at 21:39 UTC

There's no documentation on the choice of internal storage format, because it's an internal detail.

Neither operators nor subs can return arrays, just scalars.

@words = split(/ /, 'This is a sentence.');
[download]

Returning strings from a sub should have no effect on their storage format.

Re^7: Seeking Perl docs about how UTF8 flag propagates

[reply]
[d/l]

Re^3: Seeking Perl docs about how UTF8 flag propagates

by ikegami (Patriarch) on May 17, 2023 at 22:29 UTC

...except when it affects the documented behavior of functions like lc in certain situations.

As I previously mentioned, lc is (intentionally) buggy (for backwards compatibility) when not using the unicode_strings feature. So yes, it makes sense to document it.

In what sense is this not an array being returned:

In every sense. Four scalars are returned by split, which are then assigned to an existing array.

..yet sometimes does

No. split is a function (named operator), not a sub.

But while I won't rule out the possibility of a change for subs, it's definitely not possible for a function. "Returning" scalars from a function definitely has no effect on their internal storage format. That post shows no evidence that returning a value had any effect on it whatsover. The returned scalars are as the function created them.

[reply]
[d/l]
[select]

Re^3: Seeking Perl docs about how UTF8 flag propagates

by pryrt (Abbot) on May 18, 2023 at 17:43 UTC

I don't follow your meaning here. In what sense is this not an array being returned:

raygun, in case you haven't run across this idea yet, Perl has a distinction between a LIST and an ARRAY: split returns a LIST, not an ARRAY.

Per List value constructors, "List values are denoted by separating individual values by commas (and enclosing the list in parentheses where precedence requires it)", so a LIST is just the comma-separated sequence (which, as explained lower down in that same section, will interpolate any lists, arrays, or hashes that are included in the list).

Per perldata's DESCRIPTION, an ARRAY is a datatype that contains "ordered lists of scalars indexed by number". The term can also be used to refer to any variable or anonymous data that has the ARRAY datatype.

So in @words = split(/ /, 'This is a sentence.'); , the @words variable is an ARRAY variable, and it is being initialized by a LIST of scalars returned by the built-in function split.

edit:

[reply]
[d/l]
[select]

Re^4: Seeking Perl docs about how UTF8 flag propagates

by raygun (Scribe) on May 19, 2023 at 04:51 UTC

Re^5: Seeking Perl docs about how UTF8 flag propagates - lists and arrays

by Discipulus (Canon) on May 19, 2023 at 10:26 UTC

Drowning(Distraction) in(via) nomenclature

by parv (Parson) on May 19, 2023 at 09:24 UTC

Some notes below your chosen depth have not been shown here

Re^4: Seeking Perl docs about how UTF8 flag propagates

by ikegami (Patriarch) on May 18, 2023 at 18:51 UTC

Re^5: Seeking Perl docs about how UTF8 flag propagates

by pryrt (Abbot) on May 18, 2023 at 20:29 UTC

Some notes below your chosen depth have not been shown here


Perl-Sensitive Sunglasses
	PerlMonks