Handling of Unicode File Names

The Problem

I have long been bothered by the problem where I read a directory name which happens to be a UTF-8 representation of unicode, then append a unicode string to that name, then try writing out to that new filename but get an error that the directory does not exist:

$ perl -E 'mkdir("\x{100}")'
$ perl -MB -E 'my @d= <*>; say B::perlstring($_) for @d'
"\304\200"
$ perl -E 'my ($d)= <*>; open(my $f, ">", "$d/\x{101}.txt") or die "$!
+"'
No such file or directory at -e line 1.
[download]

Why? Because Perl passes the scalar to C library's 'open' and that delivers a UTF-8 encoding of the entire string, and the bytes that came from glob (and were never decoded from UTF-8) get their individual UTF-8 bytes encoded as UTF-8 characters.

Perl expects the user to keep track of which strings are unicode and which strings are bytes, and never mix the two. In the example above, the real problem/bug is that glob returns bytes, and "$d/\x{101}.txt" is mixing bytes with unicode, producing garbage.

While that answer is technically correct, I'm not satisfied with it, because it results an a sub-optimal user experience. A user *ought* to be able to list a directory, and have Unicode, append unicode to it, and write them back out. This process ought to be easy, instead of splattering the code with calls to encode() and decode(). Why can't we have nice things?

(The problem is even worse on Windows, where you must configure your program to run with the UTF-8 codepage or else you get even worse garbage, since Perl internally uses the ANSI variants of the Win32 API which replaces unrepresentable characters with placeholders)

What Does Python Do

Python 2 had a system where unicode strings were represented differently from ascii strings, and so the solution in Python 2 was "unicode in, unicode out". In other words, if you call a directory listing with a unicode directory path, all the results come back as unicode strings. So what happens if you try reading an invalid UTF-8 sequence when you requested Unicode return values? it just returns a non-unicode string in the mix with the unicode ones.

$ python2.7
Python 2.7.18 (default, Oct 10 2021, 22:29:32) 
[GCC 11.1.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> l=os.listdir(".")
>>> l
['\xc4\x80']
>>> l=os.listdir(u".")
>>> l
[u'\u0100']
[download]

(now write a file alongside it which is one correct UTF8 character and one non-utf8 byte)

$ perl -MB -E 'open(my $f, ">", "\x{C4}\x{80}\x{A0}.txt") or die "$!"'
$ python2.7
Python 2.7.18 (default, Oct 10 2021, 22:29:32) 
[GCC 11.1.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> l=os.listdir('.')
>>> l
['\xc4\x80\xa0.txt', '\xc4\x80']
>>> l=os.listdir(u'.')
>>> l
['\xc4\x80\xa0.txt', u'\u0100']
[download]

So, does this API behavior result in a sensible developer experience?

>>> open(l[1]+'/'+l[0], 'w')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0
+: ordinal not in range(128)
[download]

The answer to "what happens when you try combining ascii directory with unicode filename" is "it doesn't let you do that". So, that saves the developer from head-scratching i/o errors, and puts the exception closer to the source of the problem.

Unfortunately, Perl can't adopt this solution because Perl doesn't have a logical separation between Unicode and Ascii strings. (yes there is Perl's utf8 flag, but that's not a logical difference between contents of scalars. References available upon request.)

But, in Python 3.0, all strings are unicode! (similar in some ways to perl's stance) So what did they do for this situation?

$ python3
Python 3.11.3 (main, Jun  5 2023, 09:32:32) [GCC 13.1.1 20230429] on l
+inux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> l=os.listdir('.')
>>> l
['&#256;\udca0.txt', '&#256;']
>>>
[download]

So, er.... they return an invalid representation of the bytes? That is "\x{100}" followed by "\x{DCA0}" in place of the byte "\x{A0}". What is the Unicode 0xDC00 range? It's called the "Low Surrogate Area", and unicode.org says

Low Surrogate Area Range: DC00-DFFF Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range. See http://www.unicode.org/charts/ for access to a complete list of the latest character code charts. ... For a complete understanding of high-surrogate code units low-surrogate code units, and surrogate pairs used for the UTF-16 encoding form, see the appropriate sections of the Unicode Standard

So basically, Python 3 encodes stray non-utf8 bytes as values in a reserved-for-other-uses set of codepoints which should never appear in a real unicode string. Does it work correctly for round trips?

>>> open(l[1]+'/'+l[0], "w")
<_io.TextIOWrapper name='&#256;/&#256;\udca0.txt' mode='w' encoding='U
+TF-8'>
>>> l=os.listdir('\u0100')
>>> l
['&#256;\udca0.txt']
^d

$ perl -E '
sub escapestr {
  $_[0] =~ s/([^\x20-\x7E])/sprintf("\\x%02X", ord $1)/egr
}
say escapestr($_) for <\x{100}/*>'

\xC4\x80/\xC4\x80\xA0.txt
[download]

Sure enough, it round-trips those 0xDC00-0xDCFF codepoints back to the single non-unicode bytes they came from.

What Can We Do In Perl?

The python3 +0xDC00 solution could be used in Perl to handle non-utf8 characters in a new unicode-friendly API. But, how does this work out alongside our other APIs?

Lets suppose we add a new feature "unicodefilenames". (hopefully we wouldn't have to type that much, and could eventually lump it in with "use v5.50")

use feature 'unicodefilenames';
my ($d)= <*>;
open(my $f, ">", "$d/\x{101}.txt") or die "$!";
[download]

This works now. But what happens if we pass these file name to other modules in our program?

package New;
use v5.42;
use feature 'unicodefilenames';

Old->foo($_) for <*>;

package Old;
use v5.38;

sub foo($fname) {
  open my $fh, "<", $fname;
}
[download]

Whoops. The new unicode names get passed to a module that expects "a filename", and all filenames were previously strings of bytes, so it will get encoded as plain-old-utf8 which doesn't respect the conversion from "\xDCA0" to "\xA0". So, anyone with a european locale having lots of upper Latin-1 will end up with frequent breakage.

What if Perl handled the "\xDC00" range specially regardless of the feature bit? This would break any old code that had been writing filenames using those characters. But nobody should ever be writing them... because it would only ever occur in a UTF-16 encoding. So the only reason anyone would legitimately want to write them was if they took a UTF-16 encoded string and then further encoded that as utf-8 and wanted it to be a filename.

Assuming p5p decided that was an acceptable amount of back-compat breakage, what else could go wrong?

package New;
use v5.42;
use feature 'unicodefilenames';

Old->foo($_) for <*>;

package Old;
use v5.38;

sub foo($fname) {
  my $dir= "tmp\x{85}";
  mkdir $dir or die "$!";
  system("cp -a $fname $dir/$fname") == 0 or die "$!";
}
[download]

Whoops, there are two bugs here. First, the Old module doesn't know that it is being given a unicode filename. Then, not anticipating this to be a problem, it combines that string with a non-unicode string, resulting in garbage. Then as a second problem, it shells out to a command, and the Perl interpreter has no way of knowing whether this is a "filename" situation where 0xDC00 should be re-interpreted. Keep in mind that people might have all sorts of reasons for passing invalid unicode (or utf-16 codes) as arguments to external programs. (well, maybe not, but it seems a lot more likely than passing them as filenames to filesystem APIs)

But wait, what does Python do for passing bytes to external programs if all their strings are unicode?

$ python3
Python 3.11.3 (main, Jun  5 2023, 09:32:32) [GCC 13.1.1 20230429] on l
+inux
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
(Wrapped for readability)
>>> subprocess.run([
  'perl','-E',
  'sub escapestr {
      $_[0] =~ s/([^\x20-\x7E])/sprintf("\\x%02X", ord $1)/egr
   }
   say escapestr($ARGV[0])',
   "\x80"])
C280
>>> subprocess.run([
  'perl','-E',
  'sub escapestr {
      $_[0] =~ s/([^\x20-\x7E])/sprintf("\\x%02X", ord $1)/egr
   }
   say escapestr($ARGV[0])',
   "\x80"])
80
[download]

Woah! Pretty bold there, Python! If you want to pass the byte 0x80 as a parameter to an external program, you'd need to encode it as "\xDC80" in your always-unicode strings. (Or, use the Python3 "bytes" object instead of trying to carry around raw bytes inside unicode strings, which is what all the tutorials teach) Anyway, interesting and all, but I'm guessing this is a step too far for perl 5.

So back to filenames. What can we do? It looks like the only way we can prevent bugs from erupting everywhere is to keep using strings of plain bytes, with unicode converted to UTF-8 (or perhaps encoded according to locale, if anyone ever uses non-utf8 locales anymore). But, what if we wrap filenames with objects?

package New;
use v5.36;
use Path::UTiny; # imagine a unicode-aware Path::Tiny

# Create directory named "\xC4\x80"
path("\x{100}")->mkdir;

for (path(".")->children) {
  # compares as unicode
  Old->foo($_) if $_->name eq /\x{100}/;
}

package Old;
use v5.36;

sub foo($dir) {
  # stringify to bytes, creates file "\xC4\x80/\x80.txt"
  open my $f, '>', "$dir/\x80.txt";
}
[download]

This actually works! To be clear, I'm proposing that the path object would track unicode internally (where it could use Python3's trick of remapping the ambiguous bytes) and any time it was coerced to a string by unsuspecting legacy code, or by PerlIO API calls, it would yield the usual UTF-8 bytes.

The downside is that you still can't write

$path= path("$path/$unicode")
[download]

because that would still be combining unicode with non-unicode. The ".=" operator could be overloaded to return new Path objects, but that might also surprise users when $x .= "/$y" has different results than $x= "$x/$y" so maybe not.

Conclusion

I don't see any practical way for Perl 5 to upgrade to unicode filenames in plain strings and native PerlIO functions. It would create about as many problems as it would solve. But, a new path object library that works with unicode internally but stringifies to bytes would have a chance of being useful for working with unicode without breaking too many common assumptions.

Comment on Handling of Unicode File Names Select or Download Code