Re^5: how are ARGV and filename strings represented?
by NERDVANA (Priest) on May 02, 2024 at 16:48 UTC
|
If I'm writing C, and do dumb things with pointers like:
void function1() {
float x, y, z;
function2(&x);
}
void function2(float *point) {
point[2]= 5;
}
That is unspecified behavior. It doesn't generate a warning, and on some hosts it will work and on other hosts it will not work (or even crash) depending on how the compiler chose the internal representation. It isn't a "bug" in the C language, but it's certainly a footgun.
My understanding of Perl's filesystem rules are that the programmer is responsible for performing encoding on every string prior to passing it to the OS, using their knowledge of the OS. If you skip that encoding step (as most of us do, because there's no easy universal facility for us to know which encoding the host is using) then you run the risk of undefined behavior, which may manifest itself differently depending on internal details of the string.
I think it would be very nice to have a facility to know how to encode filesystem strings, and a library that performs that job so that doing it right is easier than not. I wouldn't consider that library to be working around a bug, but rather supplying a missing feature of the language. | [reply] [d/l] |
|
I think it would be very nice to have a facility to know how to encode filesystem strings, and a library that performs that job so that doing it right is easier than not
The problem is that at least some filesystems just don't care about encoding:
- NTFS is easy, it uses UTF-16 to encode names, and forbits only a few codepoints depending on the operating system (POSIX forbits / and NUL, as usual. Windows forbids /\:*"?<>| and NUL.)
- The Apple File System, like HFS Plus, found on OS X Macintosh systems is encoded in UTF-8.
- FAT uses an 8-bit-encoding depending on the operating system, CP437 on old DOS machines, CP850 on newer DOS machines in the "western" parts of the world, or some completely different encoding in other parts of the world and on non-DOS-Machines (e.g. Atari). FAT does NOT store any information about which encoding is used. Excluded characters depend on the OS, additionally, character 0xE5 is used to mark deleted files. On DOS machines, names are converted to upper case. Yes, FAT smells funny, but because it is used for the EFI system partition, it won't go away any time soon.
- FAT extended with Long filename support à la Microsoft (often called "VFAT") uses UCS-2 for the "long" filenames.
- exFAT (which does not look like classic FAT at all) uses UTF-16 to encode filenames. Microsoft forbids U+0000 to U+001F, and /\:*?"<>| in filenames.
- ext2, ext3, and ext4, like FAT, use an 8-bit encoding depending on the operating system, and forbid only NUL and / in filenames. Many Linux distributions assume that filenames are encoded in UTF-8, but that's completely in user space, not in the filesystem.
- The same is true for btrfs.
- And for xfs.
- And for zfs.
- And for ufs.
- And for ReiserFS.
There are many more filesystems, but I think these are still commonly in use on personal computers and PC-based servers.
As you can see, only a few filesystems use some variant of Unicode encoding for the filenames. The others use just bytes, with some "forbidden" values (eFAT), in some encoding that can not be derived from the filesystem. Userspace may decide to use UTF-8 or some legacy encoding on those filesystems.
As long as we stay in user space (as opposed to kernel space), we don't have to care much about the encoding. Converting the encoding is the job of the operating system. Legacy filesystems have their encoding set via mount options or the like.
Systems like Linux and the *BSDs just don't care about encoding, they treat filenames as null-terminated collections of bytes, just like Unix did since 1970. Modern Windows can treat filenames as Unicode when using the "wide API" (function names ending in a capital W) where all strings are using UCS-2 or UTF-16. When using the "ASCII API" (function names ending in a capital A), strings are using some legacy 8-bit encoding based on ASCII, the actual encoding used depends on user and regional settings. Plan 9 from Bell Labs used UTF-8 for the entire API. (I don't know how Apple handles filenames in their APIs.)
So, how should your proposed library handle file names?
On Windows, it should use the "wide API", period. Everything is Unicode (UTF-16/UCS-2). Same for Plan 9, but using UTF-8 encoding.
Linux and the BSDs? Everything is a null-terminated collection of bytes, maybe UTF-8, maybe some legacy encoding, and maybe also depending on where in the filesystem you are. How should your library handle that?
Oh, and let's not forget that Android is technically Linux, with a lot of custom user-space on top.
Mac OS X? I don't know the APIs, but it inherited a lot from Unix. So it propably looks like Unix.
I completely omitted other command line parameters, the environment, and I/O via STDIN/STDOUT/STDERR. Command line and environment are more or less just an API thing, like filenames. I/O is completely different. It lacks any information about the encoding and is treated as a byte stream on Unix. Windows treats it as a character stream with an unspecified 8-bit encoding, converting line endings as needed, but not characters.
See also Re^7: any use of 'use locale'? (source encoding).
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] [select] |
|
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use Unicode::Normalize qw{ NFKD NFKC };
say $^O;
my $letter = "\N{LATIN SMALL LETTER A WITH ACUTE}";
open my $out, '>', NFKD($letter) or die "Open: $!";
unlink NFKC($letter) or warn "Warn unlink: $!";
unlink NFKD($letter) or die "Unlink: $!";
Running it on Linux and Mac gives the following different outputs:
linux
Warn unlink: No such file or directory at ./script.pl line 13.
versus
darwin
Unlink: No such file or directory at ./script.pl line 14.
The same happens when you exchange NFKC and NFKD. Yes, on a Mac, normalization happens on top of UTF-8.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
So, how should your proposed library handle file names?
I like the idea of mapping invalid bytes to other characters (e.g. surrogates like Python's surrogateescape, characters beyond 0x10FFF, etc.)
This provides a way of accepting and generating any file name, while considering files names to be decodable text.
| [reply] [d/l] |
|
|
|
|
Things get even more interesting if you try to use filenames containing invalid UTF-8 sequences on various filenames:
Invalid-UTF8 vs the filesystem by Kristian Köhntopp. In summary:
- XFS on Linux and ext4 on Linux don't care at all. Filenames are just bytes.
- ZFS on Linux refuses filenames containing invalid UTF-8 sequences.
- APFS on MacOS Ventura also refuses filenames containing invalid UTF-8 sequences.
Python does not like tar archives with invalid UTF-8 sequences.
And little ugly detail: Apparently there is a function sys.getfilesystemencoding() without parameters. Python seems to assume that all filesystems have the same encoding and that it is not path dependent.
This is at least conceptually similar to my pet problem of File::Spec, assuming uniform behaviour across various mounted filesystems.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] |
|
|
|
My understanding of Perl's filesystem rules are that the programmer is responsible for performing encoding on every string prior to passing it to the OS
Yes, but open can transform that properly-encoded text into garbage. That's a bug.
If you skip that encoding step
I didn't say anything about skipping the encoding step. This has nothing to do with anything I said.
I'll continue anyway, but it's all a straw man.
It isn't a "bug" in the C language
Of course not, because the C language doesn't define the behaviour of this (which is well understood as allowing it to have any behaviour).
However, Perl does define the behaviour of open. It should create a file with the provided name. Provide a properly-encoded string consisting of bytes C3 A9 2E 74 78 74, and it should create a file with the name consisting of the bytes C3 A9 2E 74 78 74. It doesn't always do that, and that's a bug.
Your example is a false parallel.
| [reply] [d/l] [select] |
Re^5: how are ARGV and filename strings represented?
by almr (Beadle) on May 02, 2024 at 18:50 UTC
|
$d eq $u is true in the snippet I provided earlier, so open should do the same for both. But it doesn't. That's a bug.
Yes. But you created $d and $u using "internal" Perl functions (AFAICT the programmer is not supposed to invoke upgrade() / downgrade() directly?). Now, can this bug be reproduced using "supported" operations?
You can reproduce it by concatenating the result of readlink() or glob to a codepoint-string, but that in itself means breaking the conventions (because you're not supposed to mix codepoint-strings with byte-strings)
So now I'm starting to lean to the position that one should decode ARGV, decode STDIN, decode readlink() and glob output, and thus always work with codepoint-strings (which can be safely concatenated, trimmed etc and then passed to open(), because open() "calls" transform(), which detects that it needs to encode them).
| [reply] |
|
So now I'm starting to lean to the position that one should decode ARGV
Yes, but that means you might not be able to access/create some files.
Files in unix systems are arbitrary sequences of bytes, so the file name might not be decodable. Imagine two processes/users/machines using different locales accessing the same volume. That said, the de facto standardization towards UTF-8 makes problems an exception rather than the rule.
Files in Windows are encoded using UTF-16le, but Perl uses a different encoding to talk to the OS. For example, Perl uses Windows-1252 on my machine, so using @ARGV will invariably limit me to files that can be encoded using Window-1252.
In Windows, you can use CommandLineToArgvW instead of @ARGV, plus Win32::LongPath to work with any path, and you can decode them without issue.
In unix, you can work with any path by making sure to provided downgraded strings, but there's no simple solution to decoding them without loss.
One way of accessing CommandLineToArgvW (from here):
use strict;
use warnings;
use feature qw( say state );
use open ':std', ':encoding('.do { require Win32; "cp".Win32::GetConso
+leOutputCP() }.')';
use Config qw( %Config );
use Encode qw( decode encode );
use Win32::API qw( ReadMemory );
use constant PTR_SIZE => $Config{ptrsize};
use constant PTR_PACK_FORMAT =>
PTR_SIZE == 8 ? 'Q'
: PTR_SIZE == 4 ? 'L'
: die("Unrecognized ptrsize\n");
use constant PTR_WIN32API_TYPE =>
PTR_SIZE == 8 ? 'Q'
: PTR_SIZE == 4 ? 'N'
: die("Unrecognized ptrsize\n");
sub lstrlenW {
my ($ptr) = @_;
state $lstrlenW = Win32::API->new('kernel32', 'lstrlenW', PTR_WIN32
+API_TYPE, 'i')
or die($^E);
return $lstrlenW->Call($ptr);
}
sub decode_LPCWSTR {
my ($ptr) = @_;
return undef if !$ptr;
my $num_chars = lstrlenW($ptr)
or return '';
return decode('UTF-16le', ReadMemory($ptr, $num_chars * 2));
}
# Returns true on success. Returns false and sets $^E on error.
sub LocalFree {
my ($ptr) = @_;
state $LocalFree = Win32::API->new('kernel32', 'LocalFree', PTR_WIN
+32API_TYPE, PTR_WIN32API_TYPE)
or die($^E);
return $LocalFree->Call($ptr) == 0;
}
sub GetCommandLine {
state $GetCommandLine = Win32::API->new('kernel32', 'GetCommandLine
+W', '', PTR_WIN32API_TYPE)
or die($^E);
return decode_LPCWSTR($GetCommandLine->Call());
}
# Returns a reference to an array on success. Returns undef and sets $
+^E on error.
sub CommandLineToArgv {
my ($cmd_line) = @_;
state $CommandLineToArgv = Win32::API->new('shell32', 'CommandLineT
+oArgvW', 'PP', PTR_WIN32API_TYPE)
or die($^E);
my $cmd_line_encoded = encode('UTF-16le', $cmd_line."\0");
my $num_args_buf = pack('i', 0); # Allocate space for an "int".
my $arg_ptrs_ptr = $CommandLineToArgv->Call($cmd_line_encoded, $num
+_args_buf)
or return undef;
my $num_args = unpack('i', $num_args_buf);
my @args =
map { decode_LPCWSTR($_) }
unpack PTR_PACK_FORMAT.'*',
ReadMemory($arg_ptrs_ptr, PTR_SIZE * $num_args);
LocalFree($arg_ptrs_ptr);
return \@args;
}
{
my $cmd_line = GetCommandLine();
say $cmd_line;
my $args = CommandLineToArgv($cmd_line)
or die("CommandLineToArgv: $^E\n");
for my $arg (@$args) {
say "<$arg>";
}
}
| [reply] [d/l] [select] |
|
But you created $d and $u using "internal" Perl functions
We could debate that, but it's completely irrelevant. I used them because it made the example clear. But I could have used ordinary string literals to get the same behaviour.
can this bug be reproduced using "supported" operations?
utf8::upgrade and utf8::downgrade are fully supported. But yes.
because you're not supposed to mix codepoint-strings with byte-strings
One, there was no "mixing", so that's also completely irrelevant.
Two, that's completely untrue. "Mixing" strings with different internal storage is not only acceptable, it's common.
use utf8;
my $u = "Éric";
my $d = "Brine";
my $s = "$u $d"; # Perfectly ok!
| [reply] [d/l] [select] |
|
We could debate that
I don't know how it came across, but I'm not trying to debate you, of all, uh... monks. I'm just trying to find a sub-dialect of Perl in which the Unico-debacle doesn't happen.
One, there was no "mixing", so that's also completely irrelevant.
The mixing was not in your upgrade / downgrade examples, but in my previous sentence: concatenating a decoded codepoint-string (the directory) with a byte-string (the result of glob). One object "you're not supposed to" pass to open().
Two, that's completely untrue. "Mixing" strings with different internal storage is not only acceptable, it's common
So, OK, you've reminded me that path fragments could come not only from ARGV, or from a list of files read from a handle, but also from the program source, so the nightmare deepens.
Scenario 1: I have a $dirname from decoded ARGV (so it's a codepoint-string, marked as upgraded), and I "File->new($dirname) . q(/readme.txt), q(>))".
Scenario 2: Like (1), but I "File->new($dirname . q(/) . $author . q(.txt), q(>))", where $author is "Saint-Saëns" also obtained from a decoded ARGV, or read from a handle with UTF-8 perlio.
Scenario 3: like (2), but I provide "Saint-Saëns" in the program source: "$author = qq(Saint-Sa\x{00eb}ns)".
Scenario 4: like (3), but I "use utf8; $author = qq(Saint-Saëns);"
Scenarios 5 and 6: like (3), but $dirname now also comes from program source, "$dirname = q(mydir)"
Scenarios 1-4 would be ok, because at least one of the components is an upgraded codepoint-string.
5, OTOH, fails, because all of the path components are "downgraded" strings, and so the concatenated path also is. Also none of the codepoints are above 255. So open() doesn't know it needs to encode() before passing the string to libc.
6 seems to work, probably because non-ASCII string literals defined in the program source are stored as utf-8 on disk. If the program source comes from "-e" typed in the shell, I can't figure out what happens (probably depends on shell / locale)
I'm not sure what to do about this. Maybe call upgrade() or (decode()?) on any non-ASCII path component defined in the source code.
| [reply] |