in reply to Re^5: how are ARGV and filename strings represented?
in thread how are ARGV and filename strings represented?

But you created $d and $u using "internal" Perl functions

We could debate that, but it's completely irrelevant. I used them because it made the example clear. But I could have used ordinary string literals to get the same behaviour.

can this bug be reproduced using "supported" operations?

utf8::upgrade and utf8::downgrade are fully supported. But yes.

because you're not supposed to mix codepoint-strings with byte-strings

One, there was no "mixing", so that's also completely irrelevant.

Two, that's completely untrue. "Mixing" strings with different internal storage is not only acceptable, it's common.

use utf8; my $u = "Éric"; my $d = "Brine"; my $s = "$u $d"; # Perfectly ok!

Replies are listed 'Best First'.
Re^7: how are ARGV and filename strings represented?
by almr (Beadle) on May 05, 2024 at 17:20 UTC
    We could debate that

    I don't know how it came across, but I'm not trying to debate you, of all, uh... monks. I'm just trying to find a sub-dialect of Perl in which the Unico-debacle doesn't happen.

    One, there was no "mixing", so that's also completely irrelevant.

    The mixing was not in your upgrade / downgrade examples, but in my previous sentence: concatenating a decoded codepoint-string (the directory) with a byte-string (the result of glob). One object "you're not supposed to" pass to open().

    Two, that's completely untrue. "Mixing" strings with different internal storage is not only acceptable, it's common

    So, OK, you've reminded me that path fragments could come not only from ARGV, or from a list of files read from a handle, but also from the program source, so the nightmare deepens.

    Scenario 1: I have a $dirname from decoded ARGV (so it's a codepoint-string, marked as upgraded), and I "File->new($dirname) . q(/readme.txt), q(>))".

    Scenario 2: Like (1), but I "File->new($dirname . q(/) . $author . q(.txt), q(>))", where $author is "Saint-Saëns" also obtained from a decoded ARGV, or read from a handle with UTF-8 perlio.

    Scenario 3: like (2), but I provide "Saint-Saëns" in the program source: "$author = qq(Saint-Sa\x{00eb}ns)".

    Scenario 4: like (3), but I "use utf8; $author = qq(Saint-Saëns);"

    Scenarios 5 and 6: like (3), but $dirname now also comes from program source, "$dirname = q(mydir)"

    Scenarios 1-4 would be ok, because at least one of the components is an upgraded codepoint-string.

    5, OTOH, fails, because all of the path components are "downgraded" strings, and so the concatenated path also is. Also none of the codepoints are above 255. So open() doesn't know it needs to encode() before passing the string to libc.

    6 seems to work, probably because non-ASCII string literals defined in the program source are stored as utf-8 on disk. If the program source comes from "-e" typed in the shell, I can't figure out what happens (probably depends on shell / locale)

    I'm not sure what to do about this. Maybe call upgrade() or (decode()?) on any non-ASCII path component defined in the source code.