Re^2: directories and charsets

I believe here is the shortest expression of what the problem might be :

#!/usr/bin/perl
use strict;
use strict;
use warnings;
use Encode;

my $topdir;

    if (scalar(@ARGV)) {
        $topdir = shift @ARGV;
    } else {
        print "Enter top dir : ";
        $topdir = <>; chomp $topdir;
    }

    warn("top directory [$topdir] : ",(Encode::is_utf8($topdir) ? '(ut
+f8)' : '(bytes)'));
    unless (opendir(DIR,$topdir)) {
        die ("Could not open it : $!");
    }
    closedir DIR;
    warn "everything ok";
    exit 0;
[download]

If you try this in a Windows command-line, after creating a directory with a non-ascii character in the name (suppose "München" for a change), and try it consecutively as :
perl testutfdir.pl dirname
and
perl testutfdir.pl
you should see the kind of problem I'm having.

This might be the deep cause of my problems, because in the real program, I am getting the name of the top directory of my tree by parsing a parameter file, and they come to perl as utf8 strings. But the subdirectory names that I read from the disk, come in as bytes. Now when I concatenate both to get a full filename, I believe I have a problem.

Comment on Re^2: directories and charsets Download Code

Replies are listed 'Best First'.
Re^3: directories and charsets by jbert (Priest) on Mar 15, 2007 at 17:59 UTC
Sorry, don't have a windows perl to hand. I agree that the problem is your subdirectory names coming in bytes. You need to know their charset, then call Encode::decode to map them from the appropriate charset (probably utf8 or UCS-2) into perl characters. If you hex dump the bytes and take a look on http://www.fileformat.info/info/unicode/ you should be able to work out what encoding you're getting back from readdir on the different platforms. Then do: `my $encoding = "xxx"; # Probably 'UTF-8' or 'UTF-16LE' for windows my @files = map { Encode::decode($encoding, $_) } readdir DIR;` [download] Your scalars in @files will then be kosher perl unicode strings, and when they are concatenated with the unicode strings you are getting from your parameter file all should be well. Good luck.	[reply] [d/l]
Re^4: directories and charsets by soliplaya (Beadle) on Mar 15, 2007 at 21:18 UTC
Many thanks to all, I believe I am starting to see the heavenly light. It is still at the end of a long tunnel because what I really want to do in the end, is reading filenames in a directory which is a few steps away : WWW users (presumably most on Windows workstations) drop files via drag-and-drop onto a HTTP server using DAV. The HTTP/DAV server is a Linux box. My perl script runs on a nearby Windows machine, and sees ditto Linux directories via a Samba share on the Linux machine. So now all I have to figure out, is in which character set these filenames really are under Linux (iow what MS Explorer and DAV do to them), how this looks through the Samba share, and how my perl script eventually sees them. But I will bear that chalice happily now that I can see that there is some heavenly principle behind it all.	[reply]