in reply to Re^5: how are ARGV and filename strings represented?
in thread how are ARGV and filename strings represented?

I think it would be very nice to have a facility to know how to encode filesystem strings, and a library that performs that job so that doing it right is easier than not

The problem is that at least some filesystems just don't care about encoding:

There are many more filesystems, but I think these are still commonly in use on personal computers and PC-based servers.

As you can see, only a few filesystems use some variant of Unicode encoding for the filenames. The others use just bytes, with some "forbidden" values (eFAT), in some encoding that can not be derived from the filesystem. Userspace may decide to use UTF-8 or some legacy encoding on those filesystems.

As long as we stay in user space (as opposed to kernel space), we don't have to care much about the encoding. Converting the encoding is the job of the operating system. Legacy filesystems have their encoding set via mount options or the like.

Systems like Linux and the *BSDs just don't care about encoding, they treat filenames as null-terminated collections of bytes, just like Unix did since 1970. Modern Windows can treat filenames as Unicode when using the "wide API" (function names ending in a capital W) where all strings are using UCS-2 or UTF-16. When using the "ASCII API" (function names ending in a capital A), strings are using some legacy 8-bit encoding based on ASCII, the actual encoding used depends on user and regional settings. Plan 9 from Bell Labs used UTF-8 for the entire API. (I don't know how Apple handles filenames in their APIs.)

So, how should your proposed library handle file names?

On Windows, it should use the "wide API", period. Everything is Unicode (UTF-16/UCS-2). Same for Plan 9, but using UTF-8 encoding.

Linux and the BSDs? Everything is a null-terminated collection of bytes, maybe UTF-8, maybe some legacy encoding, and maybe also depending on where in the filesystem you are. How should your library handle that?

Oh, and let's not forget that Android is technically Linux, with a lot of custom user-space on top.

Mac OS X? I don't know the APIs, but it inherited a lot from Unix. So it propably looks like Unix.


I completely omitted other command line parameters, the environment, and I/O via STDIN/STDOUT/STDERR. Command line and environment are more or less just an API thing, like filenames. I/O is completely different. It lacks any information about the encoding and is treated as a byte stream on Unix. Windows treats it as a character stream with an unspecified 8-bit encoding, converting line endings as needed, but not characters.

See also Re^7: any use of 'use locale'? (source encoding).

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^7: how are ARGV and filename strings represented?
by choroba (Cardinal) on May 05, 2024 at 15:11 UTC
    > The Apple File System, like HFS Plus, found on OS X Macintosh systems is encoded in UTF-8.

    It's worse than that. Consider the following program:

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Unicode::Normalize qw{ NFKD NFKC }; say $^O; my $letter = "\N{LATIN SMALL LETTER A WITH ACUTE}"; open my $out, '>', NFKD($letter) or die "Open: $!"; unlink NFKC($letter) or warn "Warn unlink: $!"; unlink NFKD($letter) or die "Unlink: $!";

    Running it on Linux and Mac gives the following different outputs:

    linux Warn unlink: No such file or directory at ./script.pl line 13.
    versus
    darwin Unlink: No such file or directory at ./script.pl line 14.

    The same happens when you exchange NFKC and NFKD. Yes, on a Mac, normalization happens on top of UTF-8.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^7: how are ARGV and filename strings represented?
by ikegami (Patriarch) on May 05, 2024 at 13:33 UTC

    So, how should your proposed library handle file names?

    I like the idea of mapping invalid bytes to other characters (e.g. surrogates like Python's surrogateescape, characters beyond 0x10FFF, etc.)

    This provides a way of accepting and generating any file name, while considering files names to be decodable text.

      I like the idea of mapping invalid bytes to other characters

      How ever that is implemented, it must be implemented system-wide, or you will end up in chaos. So, it must become either part of the kernel, or of the libc.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        How ever that is implemented, it must be implemented system-wide, or you will end up in chaos.

        We already live in chaos. Python implemented it python-wide, and arguably resulted in less chaos than Perl.

        Another option is to pair the string of bytes with the best guestimate of its encoding within some sort of path object, and then be able to flatten it back to the same bytes it came from, and also answer questions about what it would look like in unicode and how confident we are about it's encoding. I'm proposing wrapping the paths in an object anyway, so maybe that's what I'd do. I need them to stringify back to bytes in order to interoperate with the rest of Perl, anyway. Python gets the advantage of the whole language ecosystem respecting the remapped invalid characters, so they can pass filenames around as plain strings.

        The Perl library filename-taking and filename-producing operators would need to support it, and any interface to external systems would need to be aware of it.

        But lack of support by interfaces wouldn't be that bad. You would simply end up with a file one can't create/access, which is already the case.

Re^7: how are ARGV and filename strings represented?
by afoken (Chancellor) on Oct 01, 2024 at 23:47 UTC

    Things get even more interesting if you try to use filenames containing invalid UTF-8 sequences on various filenames:

    Invalid-UTF8 vs the filesystem by Kristian Köhntopp. In summary:

    • XFS on Linux and ext4 on Linux don't care at all. Filenames are just bytes.
    • ZFS on Linux refuses filenames containing invalid UTF-8 sequences.
    • APFS on MacOS Ventura also refuses filenames containing invalid UTF-8 sequences.

    Python does not like tar archives with invalid UTF-8 sequences.

    And little ugly detail: Apparently there is a function sys.getfilesystemencoding() without parameters. Python seems to assume that all filesystems have the same encoding and that it is not path dependent.

    This is at least conceptually similar to my pet problem of File::Spec, assuming uniform behaviour across various mounted filesystems.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Just playing devil's advocate: On linux it looks like it's possible to mount external filesystems (vfat, ntfs, iso9660?) with other encodings. What happens if the path up to the mount point is in utf8 and the path after the mount point is in another encoding?

      It's true that no sane person would try this, but it is possible...

        It sure is possible, and it does not sound that insane if you are trying to recover data or the like. Just imagine working with legacy media (retro computing) or from other operating systems (e.g. MacOS, Windows, OS/2, ...). I did not assume that any filesystem would check the filenames for correct UTF-8 encoding, especially not ZFS, which comes from an old Unix and in my mind should store just bytes composing filenames.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)