in reply to Re^4: how are ARGV and filename strings represented?
in thread how are ARGV and filename strings represented?

If I'm writing C, and do dumb things with pointers like:
void function1() { float x, y, z; function2(&x); } void function2(float *point) { point[2]= 5; }
That is unspecified behavior. It doesn't generate a warning, and on some hosts it will work and on other hosts it will not work (or even crash) depending on how the compiler chose the internal representation. It isn't a "bug" in the C language, but it's certainly a footgun.

My understanding of Perl's filesystem rules are that the programmer is responsible for performing encoding on every string prior to passing it to the OS, using their knowledge of the OS. If you skip that encoding step (as most of us do, because there's no easy universal facility for us to know which encoding the host is using) then you run the risk of undefined behavior, which may manifest itself differently depending on internal details of the string.

I think it would be very nice to have a facility to know how to encode filesystem strings, and a library that performs that job so that doing it right is easier than not. I wouldn't consider that library to be working around a bug, but rather supplying a missing feature of the language.

Replies are listed 'Best First'.
Re^6: how are ARGV and filename strings represented?
by afoken (Chancellor) on May 05, 2024 at 13:17 UTC
    I think it would be very nice to have a facility to know how to encode filesystem strings, and a library that performs that job so that doing it right is easier than not

    The problem is that at least some filesystems just don't care about encoding:

    • NTFS is easy, it uses UTF-16 to encode names, and forbits only a few codepoints depending on the operating system (POSIX forbits / and NUL, as usual. Windows forbids /\:*"?<>| and NUL.)
    • The Apple File System, like HFS Plus, found on OS X Macintosh systems is encoded in UTF-8.
    • FAT uses an 8-bit-encoding depending on the operating system, CP437 on old DOS machines, CP850 on newer DOS machines in the "western" parts of the world, or some completely different encoding in other parts of the world and on non-DOS-Machines (e.g. Atari). FAT does NOT store any information about which encoding is used. Excluded characters depend on the OS, additionally, character 0xE5 is used to mark deleted files. On DOS machines, names are converted to upper case. Yes, FAT smells funny, but because it is used for the EFI system partition, it won't go away any time soon.
    • FAT extended with Long filename support à la Microsoft (often called "VFAT") uses UCS-2 for the "long" filenames.
    • exFAT (which does not look like classic FAT at all) uses UTF-16 to encode filenames. Microsoft forbids U+0000 to U+001F, and /\:*?"<>| in filenames.
    • ext2, ext3, and ext4, like FAT, use an 8-bit encoding depending on the operating system, and forbid only NUL and / in filenames. Many Linux distributions assume that filenames are encoded in UTF-8, but that's completely in user space, not in the filesystem.
    • The same is true for btrfs.
    • And for xfs.
    • And for zfs.
    • And for ufs.
    • And for ReiserFS.

    There are many more filesystems, but I think these are still commonly in use on personal computers and PC-based servers.

    As you can see, only a few filesystems use some variant of Unicode encoding for the filenames. The others use just bytes, with some "forbidden" values (eFAT), in some encoding that can not be derived from the filesystem. Userspace may decide to use UTF-8 or some legacy encoding on those filesystems.

    As long as we stay in user space (as opposed to kernel space), we don't have to care much about the encoding. Converting the encoding is the job of the operating system. Legacy filesystems have their encoding set via mount options or the like.

    Systems like Linux and the *BSDs just don't care about encoding, they treat filenames as null-terminated collections of bytes, just like Unix did since 1970. Modern Windows can treat filenames as Unicode when using the "wide API" (function names ending in a capital W) where all strings are using UCS-2 or UTF-16. When using the "ASCII API" (function names ending in a capital A), strings are using some legacy 8-bit encoding based on ASCII, the actual encoding used depends on user and regional settings. Plan 9 from Bell Labs used UTF-8 for the entire API. (I don't know how Apple handles filenames in their APIs.)

    So, how should your proposed library handle file names?

    On Windows, it should use the "wide API", period. Everything is Unicode (UTF-16/UCS-2). Same for Plan 9, but using UTF-8 encoding.

    Linux and the BSDs? Everything is a null-terminated collection of bytes, maybe UTF-8, maybe some legacy encoding, and maybe also depending on where in the filesystem you are. How should your library handle that?

    Oh, and let's not forget that Android is technically Linux, with a lot of custom user-space on top.

    Mac OS X? I don't know the APIs, but it inherited a lot from Unix. So it propably looks like Unix.


    I completely omitted other command line parameters, the environment, and I/O via STDIN/STDOUT/STDERR. Command line and environment are more or less just an API thing, like filenames. I/O is completely different. It lacks any information about the encoding and is treated as a byte stream on Unix. Windows treats it as a character stream with an unspecified 8-bit encoding, converting line endings as needed, but not characters.

    See also Re^7: any use of 'use locale'? (source encoding).

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      > The Apple File System, like HFS Plus, found on OS X Macintosh systems is encoded in UTF-8.

      It's worse than that. Consider the following program:

      #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Unicode::Normalize qw{ NFKD NFKC }; say $^O; my $letter = "\N{LATIN SMALL LETTER A WITH ACUTE}"; open my $out, '>', NFKD($letter) or die "Open: $!"; unlink NFKC($letter) or warn "Warn unlink: $!"; unlink NFKD($letter) or die "Unlink: $!";

      Running it on Linux and Mac gives the following different outputs:

      linux Warn unlink: No such file or directory at ./script.pl line 13.
      versus
      darwin Unlink: No such file or directory at ./script.pl line 14.

      The same happens when you exchange NFKC and NFKD. Yes, on a Mac, normalization happens on top of UTF-8.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      So, how should your proposed library handle file names?

      I like the idea of mapping invalid bytes to other characters (e.g. surrogates like Python's surrogateescape, characters beyond 0x10FFF, etc.)

      This provides a way of accepting and generating any file name, while considering files names to be decodable text.

        I like the idea of mapping invalid bytes to other characters

        How ever that is implemented, it must be implemented system-wide, or you will end up in chaos. So, it must become either part of the kernel, or of the libc.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Things get even more interesting if you try to use filenames containing invalid UTF-8 sequences on various filenames:

      Invalid-UTF8 vs the filesystem by Kristian Köhntopp. In summary:

      • XFS on Linux and ext4 on Linux don't care at all. Filenames are just bytes.
      • ZFS on Linux refuses filenames containing invalid UTF-8 sequences.
      • APFS on MacOS Ventura also refuses filenames containing invalid UTF-8 sequences.

      Python does not like tar archives with invalid UTF-8 sequences.

      And little ugly detail: Apparently there is a function sys.getfilesystemencoding() without parameters. Python seems to assume that all filesystems have the same encoding and that it is not path dependent.

      This is at least conceptually similar to my pet problem of File::Spec, assuming uniform behaviour across various mounted filesystems.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Just playing devil's advocate: On linux it looks like it's possible to mount external filesystems (vfat, ntfs, iso9660?) with other encodings. What happens if the path up to the mount point is in utf8 and the path after the mount point is in another encoding?

        It's true that no sane person would try this, but it is possible...

Re^6: how are ARGV and filename strings represented?
by ikegami (Patriarch) on May 03, 2024 at 18:06 UTC

    My understanding of Perl's filesystem rules are that the programmer is responsible for performing encoding on every string prior to passing it to the OS

    Yes, but open can transform that properly-encoded text into garbage. That's a bug.

    If you skip that encoding step

    I didn't say anything about skipping the encoding step. This has nothing to do with anything I said.

    I'll continue anyway, but it's all a straw man.

    It isn't a "bug" in the C language

    Of course not, because the C language doesn't define the behaviour of this (which is well understood as allowing it to have any behaviour).

    However, Perl does define the behaviour of open. It should create a file with the provided name. Provide a properly-encoded string consisting of bytes C3 A9 2E 74 78 74, and it should create a file with the name consisting of the bytes C3 A9 2E 74 78 74. It doesn't always do that, and that's a bug.

    Your example is a false parallel.