in reply to Re^6: how are ARGV and filename strings represented?
in thread how are ARGV and filename strings represented?

We could debate that

I don't know how it came across, but I'm not trying to debate you, of all, uh... monks. I'm just trying to find a sub-dialect of Perl in which the Unico-debacle doesn't happen.

One, there was no "mixing", so that's also completely irrelevant.

The mixing was not in your upgrade / downgrade examples, but in my previous sentence: concatenating a decoded codepoint-string (the directory) with a byte-string (the result of glob). One object "you're not supposed to" pass to open().

Two, that's completely untrue. "Mixing" strings with different internal storage is not only acceptable, it's common

So, OK, you've reminded me that path fragments could come not only from ARGV, or from a list of files read from a handle, but also from the program source, so the nightmare deepens.

Scenario 1: I have a $dirname from decoded ARGV (so it's a codepoint-string, marked as upgraded), and I "File->new($dirname) . q(/readme.txt), q(>))".

Scenario 2: Like (1), but I "File->new($dirname . q(/) . $author . q(.txt), q(>))", where $author is "Saint-Saëns" also obtained from a decoded ARGV, or read from a handle with UTF-8 perlio.

Scenario 3: like (2), but I provide "Saint-Saëns" in the program source: "$author = qq(Saint-Sa\x{00eb}ns)".

Scenario 4: like (3), but I "use utf8; $author = qq(Saint-Saëns);"

Scenarios 5 and 6: like (3), but $dirname now also comes from program source, "$dirname = q(mydir)"

Scenarios 1-4 would be ok, because at least one of the components is an upgraded codepoint-string.

5, OTOH, fails, because all of the path components are "downgraded" strings, and so the concatenated path also is. Also none of the codepoints are above 255. So open() doesn't know it needs to encode() before passing the string to libc.

6 seems to work, probably because non-ASCII string literals defined in the program source are stored as utf-8 on disk. If the program source comes from "-e" typed in the shell, I can't figure out what happens (probably depends on shell / locale)

I'm not sure what to do about this. Maybe call upgrade() or (decode()?) on any non-ASCII path component defined in the source code.

  • Comment on Re^7: how are ARGV and filename strings represented?