in reply to directories and charsets

I don't know about the problems you're having with network filesystems, other than the fact that it is the job of the filesystem to convert as necessary. Windows NTFS is going to be using 2-byte UCS-2 (a close relative of UTF-16) to store the filenames on disk, but Linux generally uses utf8 filenames.

That, however, is the job of smbfs and samba to sort out, though.

You should just be able to read and write utf8 filenames, as in the code below, however I get failed tests for #13 and 15. This is presumably because the filenames returned from 'glob' and 'readdir' *don't* have the utf8 flag on.

Do any monks have some more info on this? If I read a filename from a utf8 filesystem, should the filename have the utf8 flag on? (ASCII-exception permitting, of course).

perl 5.8.8

#!/usr/bin/perl use strict; use warnings; use Test::More(tests => 14); use Encode; binmode STDOUT, ':utf8'; # If you have a UTF-8 terminal my $workdir = "./tt"; mkdir $workdir; # Let it fail if it already exists # This is a byte sequence, not tagged as utf8 to perl # so theoretically perl should consider it to be in the local # encoding, normally latin1 my $place = "M\xc3\xbcnchen"; test_placename($workdir, $place); # Turn on the flag for this scalar. Since we pre-arranged for # the byte sequence of this scalar to contain valid utf8, this # scalar is now a valid perl unicode string. Encode::_utf8_on($place); test_placename($workdir, $place); exit 0; sub test_placename { my $workdir = shift; my $place = shift; my $fname = "$workdir/$place"; my $fh; ok(!-f $fname, "$fname doesn't already exist"); open($fh, ">", $fname) or die "Can't create $fname : $!"; close $fh; ok(1, "can create $fname with 'open'/close"); ok(-f $fname, "can find $fname with -f"); my @files = glob("$workdir/$place"); is(scalar @files, 1, "One file in dir via glob"); is($files[0], $fname, "and it's what we expect"); my $dh; opendir $dh, $workdir or die "Can't open $workdir : $!"; @files = grep { !/^\./ } readdir $dh; closedir $dh; is(scalar @files, 1, "One file in dir via readdir"); is($files[0], $place, "and it's what we expect"); my $num_files_unlinked = unlink($fname); is($num_files_unlinked, 1, "can remove $fname"); }

Replies are listed 'Best First'.
Re^2: directories and charsets
by soliplaya (Beadle) on Mar 15, 2007 at 16:56 UTC
    I believe here is the shortest expression of what the problem might be :
    #!/usr/bin/perl use strict; use strict; use warnings; use Encode; my $topdir; if (scalar(@ARGV)) { $topdir = shift @ARGV; } else { print "Enter top dir : "; $topdir = <>; chomp $topdir; } warn("top directory [$topdir] : ",(Encode::is_utf8($topdir) ? '(ut +f8)' : '(bytes)')); unless (opendir(DIR,$topdir)) { die ("Could not open it : $!"); } closedir DIR; warn "everything ok"; exit 0;
    If you try this in a Windows command-line, after creating a directory with a non-ascii character in the name (suppose "München" for a change), and try it consecutively as :
    perl testutfdir.pl dirname
    and
    perl testutfdir.pl
    you should see the kind of problem I'm having.

    This might be the deep cause of my problems, because in the real program, I am getting the name of the top directory of my tree by parsing a parameter file, and they come to perl as utf8 strings. But the subdirectory names that I read from the disk, come in as bytes. Now when I concatenate both to get a full filename, I believe I have a problem.

      Sorry, don't have a windows perl to hand.

      I agree that the problem is your subdirectory names coming in bytes. You need to know their charset, then call Encode::decode to map them from the appropriate charset (probably utf8 or UCS-2) into perl characters.

      If you hex dump the bytes and take a look on http://www.fileformat.info/info/unicode/ you should be able to work out what encoding you're getting back from readdir on the different platforms. Then do:

      my $encoding = "xxx"; # Probably 'UTF-8' or 'UTF-16LE' for windows my @files = map { Encode::decode($encoding, $_) } readdir DIR;
      Your scalars in @files will then be kosher perl unicode strings, and when they are concatenated with the unicode strings you are getting from your parameter file all should be well.

      Good luck.

        Many thanks to all, I believe I am starting to see the heavenly light.
        It is still at the end of a long tunnel because what I really want to do in the end, is reading filenames in a directory which is a few steps away :
        WWW users (presumably most on Windows workstations) drop files via drag-and-drop onto a HTTP server using DAV. The HTTP/DAV server is a Linux box. My perl script runs on a nearby Windows machine, and sees ditto Linux directories via a Samba share on the Linux machine.
        So now all I have to figure out, is in which character set these filenames really are under Linux (iow what MS Explorer and DAV do to them), how this looks through the Samba share, and how my perl script eventually sees them.
        But I will bear that chalice happily now that I can see that there is some heavenly principle behind it all.
Re^2: directories and charsets
by soliplaya (Beadle) on Mar 15, 2007 at 16:04 UTC
    Yesssss, thank you !

    That is exactly the kind of problem I was talking about in my first convoluted message.

    From the documentation (perl Unicode etc..) and from my personal tests, it would seem that readdir() always returns strings that are "bytes" (not internally marked as "utf8" by Perl). This is per the Encode::is_utf8($dir_entry) function.

    However, at some point it seems that after concatenating that directory entry with, for instance, the directory path whence it comes, and trying a "if (-f $fullpath)", the answer is false.

    I was now testing on a Windows machine, and I thought that Windows NTFS was storing filenames as UTF-8. But you seem to say that this is not true, and that it is UCS-2 instead. That might explain why, when trying various permutations and encodings or decodings of my filenames, I am getting errors.

    Back to testing thus, with this exciting new possibility..

      If you concatenate a utf8-tagged string with a non-utf8 tagged string, perl will silently "upgrade" the non-utf8 string to utf8, converting it under the assumption that it is in the local encoding (normally latin1, but might be settable with locale, PERL_ENCODING env or similar).

      There is a module to warn you when this happens (can't remember what it's called though).

      If such an untagged string already contains utf8 byte sequences, this will give you an incorrect double-encoding of the string.

      It seems to me that one way to get the right behaviour is to do:

      my @files = map { Encode::_utf8_on($_); } readdir DIRHANDLE;
      when reading names from a utf8-named-filesystem.

      I could be wrong on the NTFS thing, it's just that UCS-2 (UTF-16-a-like) is *very* entrenched on Windows, I'd be very surprised if NTFS wasn't using that as it's native format. (Of course, you may well see it as utf8 when you mount the share with smbfs, I'd expect smbfs to do that translation for you, but maybe it's a mount option or something).