I have too many projects as it is, but I keep coming back to the idea that Perl ought to have a universal "virtual filesystem" module to abstract away some details of the platforms it runs on. There are a lot of ways this could go, but I have two main itches to scratch:
- Seamless support for Unicode file names in a Path::Class-like API.
- The ability to work with filesystems that may be backed by real files or by emulated filesystems, i.e. browsing zip files, ftp, webdav, iso9660, and so on, and the ability to merge them together like mounts on Linux, but without needing elevated privileges to the host system.
As it happens, there is a great CPAN namespace "VFS" that a similar-minded person uploaded in 2004 and then never finished an implementation of. I've reached out to him and it seems he might be open to the idea of handing it off to me to finish. Negotiations are ongoing.
But, before I touch such a great namespace, I'd like to collect ideas from more minds than just my own! Here are some important points that I am considering:
Unicode Filenames
On UNIX, filenames are just bytes. Unix people added unicode support through the use of "Locale" features, so that unicode-aware programs could try decoding the filenames according to the locale, but Perl does not respect the locale and always returns bytes from readdir / glob / readlink / getcwd. Also, in Perl, if you take a filename that is bytes which happen to be valid UTF-8, and then append Unicode to that string, the resulting string will not be usable as a filename. (it will flatten to bytes with a warning, but double-encode the high bytes you read from readdir, so the directory won't exist)
On Windows, Perl uses the ascii API rather than the wide-character API, but the bytes you get from readdir are dependent on the Windows Code Page. This can work if the program is configured to run in the UTF-8 codepage, but that is almost never the default, so most people get garbage when they read unicode filenames under Windows, and have to do a lot of studying before they can make it work. If you do have the utf-8 codepage, it still leaves you with the mess that you would have on Unix.
There are other filesystems where path names belong to known character sets, and not left to guessing with locales. For instance with iso9660 you know from the metadata which character set is being used, and Locale doesn't enter into it. A module that walks a iso9660 filesystem should always be understood to return unicode names, and not get tangled up with the program's Locale settings.
Proposal: While I might like a mode in Perl where readdir() returns Unicode, I suspect doing that on a global basis would break things too much, so I think a better solution is to have a Path::Class / Path::Tiny themed module where it is understood that all names given and returned will properly respect unicode. By using this module, authors can be assured that their code will work properly when presented with non-ascii directory and file names, and work cross-platform.
Virtual Filesystems
There are lots of great reasons for wanting virtual filesystems in the host, like FUSE modules, but why should we have them inside Perl?
- Avoid messing with the Host:
Lets say you want to walk a tree of a Git filesystem. You could check out a git branch, but that uses extra disk space, and if the program crashes it might leave behind the files which need cleaned up. You could FUSE-mount the git branch as a mounted filesystem, but if the program crashes you'd leave behind a mount point, which could cause even more trouble. (such as preventing unmounting of the parent volume) You could use a Git API for it, but then you have to use an unfamiliar API and maybe it isn't as advanced as your favorite File::Find module. Having a "virtual filesystem" in perl would solve this, as long as your favorite File::Find module could be pointed at it. If "VFS::Path" happened to have your favorite API for traversing trees, that would solve the problem.
- Abstracting the Files Being Served:
If you write a server for i.e. WebDAV or SFTP, the first thing those modules need is a data store of files to serve. Those modules then probably also offer you back-end hooks to handle what happens when users upload a file or want a directory listing. If there was a standard VFS for this, we could seamlessly plug together the modules that serve files with the modules that provide views of filesystems without doing a bunch of messy integration. Also, if the VFS module could be trusted to not allow symlinks to escape a designated sub-tree, that would help with security when writing these sorts of modules.
- Minting Root Filesystems:
I often want to create root-level tarballs of things like device nodes or root-owned files. Currently, I need to run my perl scripts as root just to be able to create the tree to pass to tar. But, it should be possible to specify these details in memory and write out the tar file directly without ever touching the filesystem metadata. A VFS module in perl userspace would allow the code designed for writing the real filesystem to write to a tar file instead, and without root access.
- Virtualizing old code that expects root access:
If the VFS was also able to intercept core perl file operations, you could take old perl code that expects to perform operations on root-owned files, and have that code instead modify in-memory simulations of those file systems. This could be handy for unit tests, or just adapting with old code without a rewrite.
Proposal: To deal with all of these, I think the virtual filesystem should have independent filesystem objects, so they aren't all interconnected by default, and then an optional ability to use one of them to override the core perl IO operations. Each filesystem should have the ability to mount other filesystems at arbitrary paths, and each should have the ability to derive a "chroot" filesystem from an arbitrary path.
Problems
- Windows has a concept of "volumes", and Unix does not. Should the VFS have a concept of volumes-per-filesystem? or a concept of global root volumes which virtual filesystems can be mounted on? Or skip volumes entirely and let that be a "Windows user problem"? I'm leaning toward volumes-per-filesystem where most filesystems just have a default volume of '' (empty string) and then design the API in a way that avoids referencing volume name most of the time.
- In order to fully emulate the real filesystem using a virtual filesystem, I will need to track the "current directory" independent from the real filesystem. That way relative paths will resolve correctly. This also means I will need to fully resolve relative paths before using them. (so, adds overhead cost, but then that helps with implementing chroots). I think in the case where the filesystem is the real filesystem with no mounts and no chroots, I can optimize by using the real "chroot" and pass relative paths to the OS, avoiding the overhead. Thoughts?
- In order to override core perl file operations, I think I need XS. I can override CORE::GLOBAL::..., but if a module e.g. has a method named "open" then they will use "CORE::open" any time they need the one that isn't their own method, and that defeats the override I would make of CORE::GLOBAL::open. Even then, overriding PerlIO in XS won't help for XS modules that use other C libraries to open files. I'm not sure how successful this feature would be overall. Thoughts?
Prior Work
I'm not the first one with this idea, of course. So far, I've found:
- Filesys::POSIX
This module implements a full POSIX virtual filesystem, though as the name implies, it does not
handle any Windows concepts like volumes or alternate path separators. It makes the odd choice
to throw exceptions for failed operations, including 'stat' which many users would use to test
for existence of files. Tests currently fail on BSD and Win32. Aside from these problems, it
is a very complete implementation.
Oddly, there don't seem to be any CPAN plugins built on it.
- Filesys::Virtual
This module intends to be a VFS, but lacks any specification of how the API should behave, and
was last updated in 2009. It also lacks an API for file ownership (chmod etc).
CPAN has implementations for SSH, DAAP, and a FUSE adapter to use it as the back-end for a real
mounted filesystem.
- VFSsimple
Very sparse API (insufficient for most uses), and last updated 2007.
CPAN has implementations for ISO, FTP, HTTP, and "rsync" (which just uses rsync to clone a
remote file system locally)
- File::Redirect
Same idea of redirecting global PerlIO into a module, but the implementation is limited to
stat / open / close, uses XS, doesn't work on perls newer than 5.20, and was last
updated in 2012.
It comes with support for mounting Zip files into the virtual filesystem.
What Am I Forgetting?
So, if you made it through all of that, what I'm looking for are ideas! What am I forgetting? What other features would you like to see? What do you feel are deficiencies in the current popular path modules like Path::Class or Path::Tiny? Should I just be building on some other CPAN module?
Also, I wrote a rough draft of the POD for such a module at https://github.com/nrdvana/perl-VFS/blob/main/lib/VFS.pm