soliplaya has asked for the wisdom of the Perl Monks concerning the following question:
This post relates to a previous one here : utf8 in directory and filenames
in which I was relating an issue with handling utf8 directory entries. Thanks to Juerd and others for their previous inputs. Hopefully, this post is more to the point.
This is not really a problem anymore, maybe more of a curiosity. But I think it highlights at least a "sneaky" issue with the way Perl internally handles string "upgrades" to utf-8, and at worst maybe a bug. To see the case, please download and run the following program. Running it in any empty directory makes it clearer to follow.
I guess you also need Perl 5.8.1 minimum.
The program creates 2 (empty) files named "Presentación.txt", once using the default encoding of your system (assumed to be iso-8859-1), and once using a utf-8 encoding. You can see the difference between the 2 entries by doing an "ls" at the end of the run.
After creating the files, the program reads the directory via opendir() and readdir(), and scans the entries to count the files. It does this twice and compares the results. The curiosity is that, although the directory path is provided with the same content in both cases (.), the results are different.
There is of course a good reason for that, but it is somewhat counter-intuitive if one looks at the screen output.
#!/usr/bin/perl use strict; use warnings; use Encode; # Assume the initial default encoding is iso-8859-1 # We leave STDOUT to the standard encoding too, because we presume it +corresponds # to your locale and terminal emulation settings my $fname_iso = "Presentación.txt"; # fname_iso is bytes print " creating file1 [$fname_iso] " . (Encode::is_utf8($fname_iso) +? "(utf8)" : "(bytes)") . "\n"; open(F1,'>',$fname_iso) or die "cannot open F1 : $!"; close F1; my $fname_utf8 = decode('iso-8859-1',$fname_iso); # fname_utf8 is utf8 + and utf8-marked # the "o acute" should now be 2 bytes print " creating file2 [$fname_utf8] " . (Encode::is_utf8($fname_utf8 +) ? "(utf8)" : "(bytes)") . "\n"; open(F2,'>',$fname_utf8) or die "cannot open F2 : $!"; close F2; my $numfiles; my $dir_iso = "."; # iso bytes by default print "testing with path [$dir_iso] " . (Encode::is_utf8($dir_iso) + ? "(utf8)" : "(bytes)") . "\n"; $numfiles = dirread($dir_iso); print "result : $numfiles files found\n"; my $dir_utf = decode('iso-8859-1',$dir_iso); # force internal utf8 print "testing with path [$dir_utf] " . (Encode::is_utf8($dir_utf) + ? "(utf8)" : "(bytes)") . "\n"; $numfiles = dirread($dir_utf); print "result : $numfiles files found\n"; exit 0; sub dirread { my $path = shift; my $fullpath; my $numfiles = 0; opendir(DIR,$path); my @entries = readdir DIR; close DIR; foreach (@entries) { next if $_ =~ /^\./; next if $_ =~ /\.pl$/; # skip myself too print "dirread() : checking entry [$_] " . (Encode::is_utf8($_ +) ? "(utf8)" : "(bytes)") . "\n"; # should always be 'bytes' $fullpath = $path . '/' . $_ ; print "dirread() : full path is [$fullpath] " . (Encode::is_ut +f8($fullpath) ? "(utf8)" : "(bytes)") . "\n"; if (-f $fullpath) { #print " passes the -f test,"; unless (open(F1,'<',$fullpath) ) { #print " but cannot be opened : $!\n"; } else { print " ... passes !\n"; close F1; $numfiles++; } } else { print " ... fails the -f test !\n"; unless (open(F1,'<',$fullpath) ) { #print " and fails the open() : $!\n"; } } } return $numfiles; }
The dirread() sub has in principle no way to know in advance the contents of the $path argument, or wether it is or not considered by Perl as being utf8. Even if it does not contain characters with a Unicode codepoint above 128 (\x7F), it could be internally marked by Perl as utf8, as a result of prior string manipulations.
The sub also has no way to know in advance the contents of the directory entries. Because of the way readdir() works, the contents will be returned as 'bytes', but these bytes could contain (as they do in the second file) a utf8-encoded filename.
The interesting part, I find, is that only the last test fails : when the path passed as an argument is marked as utf8 (although it is a single dot), and the directory entry name contains a utf8-encoded sequence. In the case just above, the path is also marked as utf8, but the directory entry does not contain a utf8-encoded name, and the test passes.
These matters are complex and I will wait for comments to see if my testcase has some flaw, but as for now I believe that this shows at least an inconsistency :
|
|---|