in reply to Re^4: how are ARGV and filename strings represented?
in thread how are ARGV and filename strings represented?

$d eq $u is true in the snippet I provided earlier, so open should do the same for both. But it doesn't. That's a bug.

Yes. But you created $d and $u using "internal" Perl functions (AFAICT the programmer is not supposed to invoke upgrade() / downgrade() directly?). Now, can this bug be reproduced using "supported" operations?

You can reproduce it by concatenating the result of readlink() or glob to a codepoint-string, but that in itself means breaking the conventions (because you're not supposed to mix codepoint-strings with byte-strings)

So now I'm starting to lean to the position that one should decode ARGV, decode STDIN, decode readlink() and glob output, and thus always work with codepoint-strings (which can be safely concatenated, trimmed etc and then passed to open(), because open() "calls" transform(), which detects that it needs to encode them).

  • Comment on Re^5: how are ARGV and filename strings represented?

Replies are listed 'Best First'.
Re^6: how are ARGV and filename strings represented?
by ikegami (Patriarch) on May 03, 2024 at 18:32 UTC

    So now I'm starting to lean to the position that one should decode ARGV

    Yes, but that means you might not be able to access/create some files.

    Files in unix systems are arbitrary sequences of bytes, so the file name might not be decodable. Imagine two processes/users/machines using different locales accessing the same volume. That said, the de facto standardization towards UTF-8 makes problems an exception rather than the rule.

    Files in Windows are encoded using UTF-16le, but Perl uses a different encoding to talk to the OS. For example, Perl uses Windows-1252 on my machine, so using @ARGV will invariably limit me to files that can be encoded using Window-1252.

    In Windows, you can use CommandLineToArgvW instead of @ARGV, plus Win32::LongPath to work with any path, and you can decode them without issue.

    In unix, you can work with any path by making sure to provided downgraded strings, but there's no simple solution to decoding them without loss.


    One way of accessing CommandLineToArgvW (from here):

    use strict; use warnings; use feature qw( say state ); use open ':std', ':encoding('.do { require Win32; "cp".Win32::GetConso +leOutputCP() }.')'; use Config qw( %Config ); use Encode qw( decode encode ); use Win32::API qw( ReadMemory ); use constant PTR_SIZE => $Config{ptrsize}; use constant PTR_PACK_FORMAT => PTR_SIZE == 8 ? 'Q' : PTR_SIZE == 4 ? 'L' : die("Unrecognized ptrsize\n"); use constant PTR_WIN32API_TYPE => PTR_SIZE == 8 ? 'Q' : PTR_SIZE == 4 ? 'N' : die("Unrecognized ptrsize\n"); sub lstrlenW { my ($ptr) = @_; state $lstrlenW = Win32::API->new('kernel32', 'lstrlenW', PTR_WIN32 +API_TYPE, 'i') or die($^E); return $lstrlenW->Call($ptr); } sub decode_LPCWSTR { my ($ptr) = @_; return undef if !$ptr; my $num_chars = lstrlenW($ptr) or return ''; return decode('UTF-16le', ReadMemory($ptr, $num_chars * 2)); } # Returns true on success. Returns false and sets $^E on error. sub LocalFree { my ($ptr) = @_; state $LocalFree = Win32::API->new('kernel32', 'LocalFree', PTR_WIN +32API_TYPE, PTR_WIN32API_TYPE) or die($^E); return $LocalFree->Call($ptr) == 0; } sub GetCommandLine { state $GetCommandLine = Win32::API->new('kernel32', 'GetCommandLine +W', '', PTR_WIN32API_TYPE) or die($^E); return decode_LPCWSTR($GetCommandLine->Call()); } # Returns a reference to an array on success. Returns undef and sets $ +^E on error. sub CommandLineToArgv { my ($cmd_line) = @_; state $CommandLineToArgv = Win32::API->new('shell32', 'CommandLineT +oArgvW', 'PP', PTR_WIN32API_TYPE) or die($^E); my $cmd_line_encoded = encode('UTF-16le', $cmd_line."\0"); my $num_args_buf = pack('i', 0); # Allocate space for an "int". my $arg_ptrs_ptr = $CommandLineToArgv->Call($cmd_line_encoded, $num +_args_buf) or return undef; my $num_args = unpack('i', $num_args_buf); my @args = map { decode_LPCWSTR($_) } unpack PTR_PACK_FORMAT.'*', ReadMemory($arg_ptrs_ptr, PTR_SIZE * $num_args); LocalFree($arg_ptrs_ptr); return \@args; } { my $cmd_line = GetCommandLine(); say $cmd_line; my $args = CommandLineToArgv($cmd_line) or die("CommandLineToArgv: $^E\n"); for my $arg (@$args) { say "<$arg>"; } }
Re^6: how are ARGV and filename strings represented?
by ikegami (Patriarch) on May 03, 2024 at 18:10 UTC

    But you created $d and $u using "internal" Perl functions

    We could debate that, but it's completely irrelevant. I used them because it made the example clear. But I could have used ordinary string literals to get the same behaviour.

    can this bug be reproduced using "supported" operations?

    utf8::upgrade and utf8::downgrade are fully supported. But yes.

    because you're not supposed to mix codepoint-strings with byte-strings

    One, there was no "mixing", so that's also completely irrelevant.

    Two, that's completely untrue. "Mixing" strings with different internal storage is not only acceptable, it's common.

    use utf8; my $u = "Éric"; my $d = "Brine"; my $s = "$u $d"; # Perfectly ok!
      We could debate that

      I don't know how it came across, but I'm not trying to debate you, of all, uh... monks. I'm just trying to find a sub-dialect of Perl in which the Unico-debacle doesn't happen.

      One, there was no "mixing", so that's also completely irrelevant.

      The mixing was not in your upgrade / downgrade examples, but in my previous sentence: concatenating a decoded codepoint-string (the directory) with a byte-string (the result of glob). One object "you're not supposed to" pass to open().

      Two, that's completely untrue. "Mixing" strings with different internal storage is not only acceptable, it's common

      So, OK, you've reminded me that path fragments could come not only from ARGV, or from a list of files read from a handle, but also from the program source, so the nightmare deepens.

      Scenario 1: I have a $dirname from decoded ARGV (so it's a codepoint-string, marked as upgraded), and I "File->new($dirname) . q(/readme.txt), q(>))".

      Scenario 2: Like (1), but I "File->new($dirname . q(/) . $author . q(.txt), q(>))", where $author is "Saint-Saëns" also obtained from a decoded ARGV, or read from a handle with UTF-8 perlio.

      Scenario 3: like (2), but I provide "Saint-Saëns" in the program source: "$author = qq(Saint-Sa\x{00eb}ns)".

      Scenario 4: like (3), but I "use utf8; $author = qq(Saint-Saëns);"

      Scenarios 5 and 6: like (3), but $dirname now also comes from program source, "$dirname = q(mydir)"

      Scenarios 1-4 would be ok, because at least one of the components is an upgraded codepoint-string.

      5, OTOH, fails, because all of the path components are "downgraded" strings, and so the concatenated path also is. Also none of the codepoints are above 255. So open() doesn't know it needs to encode() before passing the string to libc.

      6 seems to work, probably because non-ASCII string literals defined in the program source are stored as utf-8 on disk. If the program source comes from "-e" typed in the shell, I can't figure out what happens (probably depends on shell / locale)

      I'm not sure what to do about this. Maybe call upgrade() or (decode()?) on any non-ASCII path component defined in the source code.