soliplaya has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am no stranger to the intricacies of character set encodings and conversions between them, but I am still a bit confused after reading perluniintro, Encode, the open pragma, etc.. as to what one needs to do exactly in my case :
Imagine a Linux-based Apache2 webserver with DAV, and some directories available for Windows PC users to upload files. These users connect to these directories by means of the Windows-based "web folders" (DAV), and can thus drag-and-drop files in Windows Explorer from their local PC filesystem to the directories located on the server. The users are, for instance, Spanish, and upload files named, for instance, "Presentación.ppt" (notice the accent on the ó).
Ditto files land on the server, with filenames obviously utf-8 encoded.
On the other hand, these same server directories are "exported" by means of Samba, so that they are visible to a Perl script running on a separate Windows system (Perl v5.8.8). The script opens and reads these directories by means of
opendir(DIR,$dirpath); my @A = readdir DIR; closedir DIR; ... my $entry = shift $A; ...
And the question is : what kind of character encoding will the directory entry "Presentación.ppt" be in, and on what does it depend ?
(From my tests, it would seem that the entry is considered as 'bytes', but these bytes include the 2 bytes that represent the utf-8-encoded character "ó"; in other words, Perl reads the entry with all it's bytes correctly, but $entry does not have the "is_utf8" flag set).
A secondary wonder is that, when I do a
if (-f "$dirpath/$entry")
the result is false, and if I try to open() the file, it returns an error.
Additional note : files that have no "accented characters" in their names are seen and processed fine. Similarly, if I manually rename the server file to, for instance, "Presentacion.ppt", it is handled fine thereafter (meaning that the -f now returns true, and the open() works.
I would be grateful for any tip. André (with accent)

Replies are listed 'Best First'.
Re: utf8 in directory and filenames
by Juerd (Abbot) on Nov 13, 2006 at 17:22 UTC

    Please read perlunitut and perluniadvice.

    Filenames live outside Perl, so you need to decode and encode them explicitly (and then hope the bytes are exactly how the file is stored) every time. In other words: for filenames, use byte strings, not unicode text strings.

    A filename that you get from readdir or glob is already a properly encoded byte string. You can use it to open a file, without decoding or encoding the string.

    And the question is : what kind of character encoding will the directory entry "Presentación.ppt" be in, and on what does it depend?

    The character encoding will depend on the filesystem, and encoding layers used by the implementation of the filesystem, if any. In any case, you cannot be sure in a platform independent way. (Yes, that sucks.)

    Perl reads the entry with all it's bytes correctly, but $entry does not have the "is_utf8" flag set).

    That would be wrong. A filename is a binary string, not a text string. It consists of bytes, not characters. In order to use it as a text string, you have to decode it first. But it can be very hard to find out HOW to decode it, and certainly perl can't figure it out for you.

    it returns an error.

    Which error?

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      Sorry Juerd,
      but I cannot open the links you mention at the top of your reply (I'm getting your 404 ever-varying denials, and cannot figure out how to edit the links to make them work).
      As to "You can use it to open the file..", that's wrong. The "-f" test fails on such a file entry, and the open() also (I don't have the specific $! error code, I'll have to modify the script for that). If I manually change the filename on the server (e.g. removing the accented character, leaving e.g. "Presentacin.ppt"), then the "-f" succeeds and the open() also (without changing anything else in the script).
      Also sorry, I indicated the wrong version of Perl : it is 5.8.3, not 5.8.8 on that server. I did go through the "Changes" sections from 5.8.3 to 5.8.8 though, without finding anything directly on the subject.
      The problem occurs in a rather large piece of code; I'll try to make a short testcase version that I can post here.
      Thanks for your interest.

        but I cannot open the links you mention at the top of your reply (I'm getting your 404 ever-varying denials, and cannot figure out how to edit the links to make them work).

        Oops. Used Mediawiki syntax. They're now corrected and tested.

        As to "You can use it to open the file..", that's wrong.

        I'll want to see the error message (and then probably some more info) before I agree. But you can read "You can use it" as "You should be able to use it" in the meantime :)

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Re: utf8 in directory and filenames
by zentara (Cardinal) on Nov 13, 2006 at 16:45 UTC
    Here is a tip I got from graff, seeRe: problems with extended ascii characters in filenames

    Summary:

    #this decode utf8 routine is used so filenames with extended # ascii characters (unicode) in filenames, will work properly use Encode; opendir my $dh, $path or warn "Error: $!"; my @files = grep !/^\.\.?$/, readdir $dh; closedir $dh; # @files = map{ "$path/".$_ } sort @files; #$_ = decode( 'utf8', $_ ) for ( @files ); @files = map { decode( 'utf8', "$path/".$_ ) } sort @files;

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum

      Note that the result from decode is a text string, and should never be used as a filename. It's good for displaying the filename to human beings, but not for actually opening the file or storing the filename. When that poses a problem, because the filename must be stored in a text document that's actually meant for computers, consider finding a way to encode the bytes to an ASCII-compatible format, like with URI-escaping or quoted printable.

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      Thank you so much. I have used your code. It works perfectly.
      Thank you.
      I get the idea about "decoding" the byte string, so that Perl would know that this is to be treated as utf-8.
      What I do still not quite get, is why the "-f" and the open() do fail when used on the original entry resulting from the readdir(). Anyone have an idea ?

      (I'm also not quite sure if I'm pushing the right button to send this answer, but I guess I'll find out)
Re: utf8 in directory and filenames
by graff (Chancellor) on Nov 14, 2006 at 02:34 UTC
    I did some playing around, and discovered that I could create a file on my macosx with utf8 characters in the file name (using perl). Of course, to the unix-based OS, such a file name just has some bytes whose high bits are set; their interpretation as characters in a given encoding (and how many octets there may be per character) is of no consequence to the OS or the library calls that handle file access.

    Looking at the OP code snippets, there may have been a typo there, so I'd like to get better clarification. Can you post a minimal, fully-functional and self-contained script that demonstrates your problem with the "-f" operator -- and have it produce some informative output that you can post along with the code? Something like this:

    #!/usr/bin/perl use strict; binmode STDOUT, ":raw"; # make sure output is not "embellished" my $path = ( $ARGV[0] and -d $ARGV[0] ) ? shift : "."; opendir( D, $path ); my @files = grep /[^.]/, readdir( D ); for my $file ( @files ) { my $type = ( -f "$path/$file" ) ? "file" : ( -d _ ) ? "subd" : "othr"; print "$file ==$type\n"; }
    If the output of that snippet is piped to a hex dump tool (like "od -txC"), the result would be informative. I tried that on the directory where I created a utf8 file name, and I got what I expected: a two byte sequence in the file name that happens to work as a utf8 accented character. Doing nothing at all to the string returned by readdir(), I was able to get the correct stat info on that file.

    (The apparent typo in the OP was:  $entry = shift $A; -- I hope you have "use strict;" in there somewhere so that, assuming $A is not explicitly declared, this sort of typo would be a compile-time error? Actually, that statement would cause an error anyway, but still...)

    If your linux server is anything like my bsd-based mac, the file names with above-ascii characters should show up as files (unless they happen to be directories), and their names should contain whatever byte sequence was delivered to the server by those Windows systems -- no matter what their encoding that may have been.

    Update: Okay, as Juerd has been trying to explain, things are a little more dicey, thanks to Perl 5.8's "special treatment" of bytes/utf8-codepoints in the 0x80-0xff range. (The success I reported above was only true when the "wide character" in the file name was above U+00FF.)

    When I tried the lengthy script you posted in your most recent reply, it broke at line 12 with "Invalid argument" -- for some reason, the string being passed to open() as a file name was not acceptable.

    I found I could create a file name containing bytes in the 80-FF range if I did it like this:

    use Encode; $_=encode("utf8","l\x{00e3}s"); # make sure the utf8 flag is off! print "creating $_\n"; open( O,">",$_ ) or die "$_:$!\n"; print O "test\n" close O;
    But strangely -- and unfortunately -- the actual file name that appears in the directory turns out to be five bytes long instead of the expected four: "l a \xCC \x83 s".

    What happened? Perl (or an underlying library?) somehow decided to break the ã character into its two distinct components: the unadorned (ascii) "a", and the unicode "combining tilde" (U+0303, which shows up in utf8 as the two-byte sequence "\xCC \x83". Why/how is it happening? I don't know yet.

    Well, that's an eye-opener. I wish I knew how to circumvent that sort of behavior, especially since it seems to apply only to bytes/codepoints in the 80-FF range.

      no attempt to explain what happened in your case, just one more data point: I ran your code snippet (that with the \x{00e3} char) on several Linux boxes and got a filename represented by the 4-byte sequence "l \xC3 \xA3 s" (i.e. the UTF-8 encoding, as expected). So, it doesn't seem to be Perl that's doing the conversion you observe...