Re^4: how are ARGV and filename strings represented?
by ikegami (Patriarch) on May 02, 2024 at 01:35 UTC
|
It's not a "bug" per-se
It is. And that's the official name for code that behaves differently depending on the internal representation of a string.
just unspecified behavior that people constantly run into.
The issue is not that the behaviour is unspecified. A lot of Perl behaviour is unspecified. That's not really an issue because there's only one Perl interpreter. The interpreter is the language, so to speak.
The issue is that two equal strings produce different results. That is most definitely a bug.
$d eq $u is true in the snippet I provided earlier, so open should do the same for both. But it doesn't. That's a bug.
there is no one-size-fits-all solution for character encoding,
Not so. In this area, it's most definitely possibly to provide an interface that works on all systems.
until such time as Strawberry releases a perl.exe with the UTF-8 codepage configured.
I did file a ticket requesting this some time ago.
| [reply] [d/l] [select] |
|
If I'm writing C, and do dumb things with pointers like:
void function1() {
float x, y, z;
function2(&x);
}
void function2(float *point) {
point[2]= 5;
}
That is unspecified behavior. It doesn't generate a warning, and on some hosts it will work and on other hosts it will not work (or even crash) depending on how the compiler chose the internal representation. It isn't a "bug" in the C language, but it's certainly a footgun.
My understanding of Perl's filesystem rules are that the programmer is responsible for performing encoding on every string prior to passing it to the OS, using their knowledge of the OS. If you skip that encoding step (as most of us do, because there's no easy universal facility for us to know which encoding the host is using) then you run the risk of undefined behavior, which may manifest itself differently depending on internal details of the string.
I think it would be very nice to have a facility to know how to encode filesystem strings, and a library that performs that job so that doing it right is easier than not. I wouldn't consider that library to be working around a bug, but rather supplying a missing feature of the language. | [reply] [d/l] |
|
I think it would be very nice to have a facility to know how to encode filesystem strings, and a library that performs that job so that doing it right is easier than not
The problem is that at least some filesystems just don't care about encoding:
- NTFS is easy, it uses UTF-16 to encode names, and forbits only a few codepoints depending on the operating system (POSIX forbits / and NUL, as usual. Windows forbids /\:*"?<>| and NUL.)
- The Apple File System, like HFS Plus, found on OS X Macintosh systems is encoded in UTF-8.
- FAT uses an 8-bit-encoding depending on the operating system, CP437 on old DOS machines, CP850 on newer DOS machines in the "western" parts of the world, or some completely different encoding in other parts of the world and on non-DOS-Machines (e.g. Atari). FAT does NOT store any information about which encoding is used. Excluded characters depend on the OS, additionally, character 0xE5 is used to mark deleted files. On DOS machines, names are converted to upper case. Yes, FAT smells funny, but because it is used for the EFI system partition, it won't go away any time soon.
- FAT extended with Long filename support ŕ la Microsoft (often called "VFAT") uses UCS-2 for the "long" filenames.
- exFAT (which does not look like classic FAT at all) uses UTF-16 to encode filenames. Microsoft forbids U+0000 to U+001F, and /\:*?"<>| in filenames.
- ext2, ext3, and ext4, like FAT, use an 8-bit encoding depending on the operating system, and forbid only NUL and / in filenames. Many Linux distributions assume that filenames are encoded in UTF-8, but that's completely in user space, not in the filesystem.
- The same is true for btrfs.
- And for xfs.
- And for zfs.
- And for ufs.
- And for ReiserFS.
There are many more filesystems, but I think these are still commonly in use on personal computers and PC-based servers.
As you can see, only a few filesystems use some variant of Unicode encoding for the filenames. The others use just bytes, with some "forbidden" values (eFAT), in some encoding that can not be derived from the filesystem. Userspace may decide to use UTF-8 or some legacy encoding on those filesystems.
As long as we stay in user space (as opposed to kernel space), we don't have to care much about the encoding. Converting the encoding is the job of the operating system. Legacy filesystems have their encoding set via mount options or the like.
Systems like Linux and the *BSDs just don't care about encoding, they treat filenames as null-terminated collections of bytes, just like Unix did since 1970. Modern Windows can treat filenames as Unicode when using the "wide API" (function names ending in a capital W) where all strings are using UCS-2 or UTF-16. When using the "ASCII API" (function names ending in a capital A), strings are using some legacy 8-bit encoding based on ASCII, the actual encoding used depends on user and regional settings. Plan 9 from Bell Labs used UTF-8 for the entire API. (I don't know how Apple handles filenames in their APIs.)
So, how should your proposed library handle file names?
On Windows, it should use the "wide API", period. Everything is Unicode (UTF-16/UCS-2). Same for Plan 9, but using UTF-8 encoding.
Linux and the BSDs? Everything is a null-terminated collection of bytes, maybe UTF-8, maybe some legacy encoding, and maybe also depending on where in the filesystem you are. How should your library handle that?
Oh, and let's not forget that Android is technically Linux, with a lot of custom user-space on top.
Mac OS X? I don't know the APIs, but it inherited a lot from Unix. So it propably looks like Unix.
I completely omitted other command line parameters, the environment, and I/O via STDIN/STDOUT/STDERR. Command line and environment are more or less just an API thing, like filenames. I/O is completely different. It lacks any information about the encoding and is treated as a byte stream on Unix. Windows treats it as a character stream with an unspecified 8-bit encoding, converting line endings as needed, but not characters.
See also Re^7: any use of 'use locale'? (source encoding).
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
My understanding of Perl's filesystem rules are that the programmer is responsible for performing encoding on every string prior to passing it to the OS
Yes, but open can transform that properly-encoded text into garbage. That's a bug.
If you skip that encoding step
I didn't say anything about skipping the encoding step. This has nothing to do with anything I said.
I'll continue anyway, but it's all a straw man.
It isn't a "bug" in the C language
Of course not, because the C language doesn't define the behaviour of this (which is well understood as allowing it to have any behaviour).
However, Perl does define the behaviour of open. It should create a file with the provided name. Provide a properly-encoded string consisting of bytes C3 A9 2E 74 78 74, and it should create a file with the name consisting of the bytes C3 A9 2E 74 78 74. It doesn't always do that, and that's a bug.
Your example is a false parallel.
| [reply] [d/l] [select] |
|
$d eq $u is true in the snippet I provided earlier, so open should do the same for both. But it doesn't. That's a bug.
Yes. But you created $d and $u using "internal" Perl functions (AFAICT the programmer is not supposed to invoke upgrade() / downgrade() directly?). Now, can this bug be reproduced using "supported" operations?
You can reproduce it by concatenating the result of readlink() or glob to a codepoint-string, but that in itself means breaking the conventions (because you're not supposed to mix codepoint-strings with byte-strings)
So now I'm starting to lean to the position that one should decode ARGV, decode STDIN, decode readlink() and glob output, and thus always work with codepoint-strings (which can be safely concatenated, trimmed etc and then passed to open(), because open() "calls" transform(), which detects that it needs to encode them).
| [reply] |
|
So now I'm starting to lean to the position that one should decode ARGV
Yes, but that means you might not be able to access/create some files.
Files in unix systems are arbitrary sequences of bytes, so the file name might not be decodable. Imagine two processes/users/machines using different locales accessing the same volume. That said, the de facto standardization towards UTF-8 makes problems an exception rather than the rule.
Files in Windows are encoded using UTF-16le, but Perl uses a different encoding to talk to the OS. For example, Perl uses Windows-1252 on my machine, so using @ARGV will invariably limit me to files that can be encoded using Window-1252.
In Windows, you can use CommandLineToArgvW instead of @ARGV, plus Win32::LongPath to work with any path, and you can decode them without issue.
In unix, you can work with any path by making sure to provided downgraded strings, but there's no simple solution to decoding them without loss.
One way of accessing CommandLineToArgvW (from here):
use strict;
use warnings;
use feature qw( say state );
use open ':std', ':encoding('.do { require Win32; "cp".Win32::GetConso
+leOutputCP() }.')';
use Config qw( %Config );
use Encode qw( decode encode );
use Win32::API qw( ReadMemory );
use constant PTR_SIZE => $Config{ptrsize};
use constant PTR_PACK_FORMAT =>
PTR_SIZE == 8 ? 'Q'
: PTR_SIZE == 4 ? 'L'
: die("Unrecognized ptrsize\n");
use constant PTR_WIN32API_TYPE =>
PTR_SIZE == 8 ? 'Q'
: PTR_SIZE == 4 ? 'N'
: die("Unrecognized ptrsize\n");
sub lstrlenW {
my ($ptr) = @_;
state $lstrlenW = Win32::API->new('kernel32', 'lstrlenW', PTR_WIN32
+API_TYPE, 'i')
or die($^E);
return $lstrlenW->Call($ptr);
}
sub decode_LPCWSTR {
my ($ptr) = @_;
return undef if !$ptr;
my $num_chars = lstrlenW($ptr)
or return '';
return decode('UTF-16le', ReadMemory($ptr, $num_chars * 2));
}
# Returns true on success. Returns false and sets $^E on error.
sub LocalFree {
my ($ptr) = @_;
state $LocalFree = Win32::API->new('kernel32', 'LocalFree', PTR_WIN
+32API_TYPE, PTR_WIN32API_TYPE)
or die($^E);
return $LocalFree->Call($ptr) == 0;
}
sub GetCommandLine {
state $GetCommandLine = Win32::API->new('kernel32', 'GetCommandLine
+W', '', PTR_WIN32API_TYPE)
or die($^E);
return decode_LPCWSTR($GetCommandLine->Call());
}
# Returns a reference to an array on success. Returns undef and sets $
+^E on error.
sub CommandLineToArgv {
my ($cmd_line) = @_;
state $CommandLineToArgv = Win32::API->new('shell32', 'CommandLineT
+oArgvW', 'PP', PTR_WIN32API_TYPE)
or die($^E);
my $cmd_line_encoded = encode('UTF-16le', $cmd_line."\0");
my $num_args_buf = pack('i', 0); # Allocate space for an "int".
my $arg_ptrs_ptr = $CommandLineToArgv->Call($cmd_line_encoded, $num
+_args_buf)
or return undef;
my $num_args = unpack('i', $num_args_buf);
my @args =
map { decode_LPCWSTR($_) }
unpack PTR_PACK_FORMAT.'*',
ReadMemory($arg_ptrs_ptr, PTR_SIZE * $num_args);
LocalFree($arg_ptrs_ptr);
return \@args;
}
{
my $cmd_line = GetCommandLine();
say $cmd_line;
my $args = CommandLineToArgv($cmd_line)
or die("CommandLineToArgv: $^E\n");
for my $arg (@$args) {
say "<$arg>";
}
}
| [reply] [d/l] [select] |
|
But you created $d and $u using "internal" Perl functions
We could debate that, but it's completely irrelevant. I used them because it made the example clear. But I could have used ordinary string literals to get the same behaviour.
can this bug be reproduced using "supported" operations?
utf8::upgrade and utf8::downgrade are fully supported. But yes.
because you're not supposed to mix codepoint-strings with byte-strings
One, there was no "mixing", so that's also completely irrelevant.
Two, that's completely untrue. "Mixing" strings with different internal storage is not only acceptable, it's common.
use utf8;
my $u = "Éric";
my $d = "Brine";
my $s = "$u $d"; # Perfectly ok!
| [reply] [d/l] [select] |
|