dir/filenames and utf8

soliplaya has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

This post relates to a previous one here : utf8 in directory and filenames
in which I was relating an issue with handling utf8 directory entries. Thanks to Juerd and others for their previous inputs. Hopefully, this post is more to the point.

This is not really a problem anymore, maybe more of a curiosity. But I think it highlights at least a "sneaky" issue with the way Perl internally handles string "upgrades" to utf-8, and at worst maybe a bug. To see the case, please download and run the following program. Running it in any empty directory makes it clearer to follow.
I guess you also need Perl 5.8.1 minimum.

The program creates 2 (empty) files named "Presentación.txt", once using the default encoding of your system (assumed to be iso-8859-1), and once using a utf-8 encoding. You can see the difference between the 2 entries by doing an "ls" at the end of the run.
After creating the files, the program reads the directory via opendir() and readdir(), and scans the entries to count the files. It does this twice and compares the results. The curiosity is that, although the directory path is provided with the same content in both cases (.), the results are different.
There is of course a good reason for that, but it is somewhat counter-intuitive if one looks at the screen output.

#!/usr/bin/perl
use strict;
use warnings;
use Encode;

# Assume the initial default encoding is iso-8859-1
# We leave STDOUT to the standard encoding too, because we presume it 
+corresponds
# to your locale and terminal emulation settings

my $fname_iso = "Presentación.txt"; # fname_iso is bytes
print "  creating file1 [$fname_iso] " . (Encode::is_utf8($fname_iso) 
+? "(utf8)" : "(bytes)") . "\n";
open(F1,'>',$fname_iso) or die "cannot open F1 : $!";
close F1;

my $fname_utf8 = decode('iso-8859-1',$fname_iso); # fname_utf8 is utf8
+ and utf8-marked
# the "o acute" should now be 2 bytes
print "  creating file2 [$fname_utf8] " . (Encode::is_utf8($fname_utf8
+) ? "(utf8)" : "(bytes)") . "\n";
open(F2,'>',$fname_utf8) or die "cannot open F2 : $!";
close F2;

my $numfiles;

    my $dir_iso = "."; # iso bytes by default
    print "testing with path [$dir_iso] " . (Encode::is_utf8($dir_iso)
+ ? "(utf8)" : "(bytes)") . "\n";
    $numfiles = dirread($dir_iso);
    print "result : $numfiles files found\n";

    my $dir_utf = decode('iso-8859-1',$dir_iso); # force internal utf8
    print "testing with path [$dir_utf] " . (Encode::is_utf8($dir_utf)
+ ? "(utf8)" : "(bytes)") . "\n";
    $numfiles = dirread($dir_utf);
    print "result : $numfiles files found\n";

    exit 0;

sub dirread {
    my $path = shift;
    my $fullpath;
    my $numfiles = 0;
    opendir(DIR,$path);
    my @entries = readdir DIR;
    close DIR;
    foreach (@entries) {
        next if $_ =~ /^\./;
        next if $_ =~ /\.pl$/; # skip myself too
        print "dirread() : checking entry [$_] " . (Encode::is_utf8($_
+) ? "(utf8)" : "(bytes)") . "\n"; # should always be 'bytes'
        $fullpath = $path . '/' . $_ ;
        print "dirread() : full path is [$fullpath] " . (Encode::is_ut
+f8($fullpath) ? "(utf8)" : "(bytes)") . "\n";
        if (-f $fullpath) {
            #print "    passes the -f test,";
            unless (open(F1,'<',$fullpath) ) {
                #print "  but cannot be opened : $!\n";
            } else {
                print "   ... passes !\n";
                close F1;
                $numfiles++;
            }
        } else {
            print "    ... fails the -f test !\n";
            unless (open(F1,'<',$fullpath) ) {
                #print "  and fails the open() : $!\n";
            }

        }

    }
    return $numfiles;
}
[download]

Comments:

The dirread() sub has in principle no way to know in advance the contents of the $path argument, or wether it is or not considered by Perl as being utf8. Even if it does not contain characters with a Unicode codepoint above 128 (\x7F), it could be internally marked by Perl as utf8, as a result of prior string manipulations.

The sub also has no way to know in advance the contents of the directory entries. Because of the way readdir() works, the contents will be returned as 'bytes', but these bytes could contain (as they do in the second file) a utf8-encoded filename.

The interesting part, I find, is that only the last test fails : when the path passed as an argument is marked as utf8 (although it is a single dot), and the directory entry name contains a utf8-encoded sequence. In the case just above, the path is also marked as utf8, but the directory entry does not contain a utf8-encoded name, and the test passes.

These matters are complex and I will wait for comments to see if my testcase has some flaw, but as for now I believe that this shows at least an inconsistency :
In the second call to the sub, the directory path is marked as utf8, but it contains a single byte anyway (a dot, which has the same encoding in utf8 as in iso-8859.1). This directory path is then concatenated with a directory entry, with in both cases is a string of 'bytes' not utf8-marked. Because the path element is marked as utf8, the concatenated value should, in both cases, be itself marked as utf8 (as far as I understand Perl would not "downgrade" a string). Thus, when Perl takes the directory entry read via readdir() (and which is bytes) and concatenates it with the path, it should either convert the byte string to utf8, or leave it as is. From the screen output, it appears that it leaves it as is (Which to me seems sensible, because Perl has no way to know the encoding of the directory entry.)
Then, when using the fullpath as an argument to the "-f" test, Perl should either leave the content of fullpath as is, or (noticing that it is utf8), reconvert it to 'bytes' prior to the test. I assume again that it leaves it "as is", because doing otherwise might create a real mess.
But then, why does the last test fail ?

Comment on dir/filenames and utf8 Download Code