in reply to utf8 in directory and filenames

Please read perlunitut and perluniadvice.

Filenames live outside Perl, so you need to decode and encode them explicitly (and then hope the bytes are exactly how the file is stored) every time. In other words: for filenames, use byte strings, not unicode text strings.

A filename that you get from readdir or glob is already a properly encoded byte string. You can use it to open a file, without decoding or encoding the string.

And the question is : what kind of character encoding will the directory entry "Presentación.ppt" be in, and on what does it depend?

The character encoding will depend on the filesystem, and encoding layers used by the implementation of the filesystem, if any. In any case, you cannot be sure in a platform independent way. (Yes, that sucks.)

Perl reads the entry with all it's bytes correctly, but $entry does not have the "is_utf8" flag set).

That would be wrong. A filename is a binary string, not a text string. It consists of bytes, not characters. In order to use it as a text string, you have to decode it first. But it can be very hard to find out HOW to decode it, and certainly perl can't figure it out for you.

it returns an error.

Which error?

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Replies are listed 'Best First'.
Re^2: utf8 in directory and filenames
by soliplaya (Beadle) on Nov 13, 2006 at 17:58 UTC
    Sorry Juerd,
    but I cannot open the links you mention at the top of your reply (I'm getting your 404 ever-varying denials, and cannot figure out how to edit the links to make them work).
    As to "You can use it to open the file..", that's wrong. The "-f" test fails on such a file entry, and the open() also (I don't have the specific $! error code, I'll have to modify the script for that). If I manually change the filename on the server (e.g. removing the accented character, leaving e.g. "Presentacin.ppt"), then the "-f" succeeds and the open() also (without changing anything else in the script).
    Also sorry, I indicated the wrong version of Perl : it is 5.8.3, not 5.8.8 on that server. I did go through the "Changes" sections from 5.8.3 to 5.8.8 though, without finding anything directly on the subject.
    The problem occurs in a rather large piece of code; I'll try to make a short testcase version that I can post here.
    Thanks for your interest.

      but I cannot open the links you mention at the top of your reply (I'm getting your 404 ever-varying denials, and cannot figure out how to edit the links to make them work).

      Oops. Used Mediawiki syntax. They're now corrected and tested.

      As to "You can use it to open the file..", that's wrong.

      I'll want to see the error message (and then probably some more info) before I agree. But you can read "You can use it" as "You should be able to use it" in the meantime :)

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        I have read the documents in question, and understand what they say. Really. I have been handling these character set issues for a while, including the Unicode/ISO conversions back and forth in Perl (and iso-8859-x, and "modified utf-7 for IMAP", etc..). I just thought I'd insist on this so that you wouldn't think that I don't understand the basic issues at hand.

        My problem is, specifically, that I had a case where a filename entry that was created by a Windows PC, on a Linux server, and which did contain utf-8 encoded characters, failed the "-f" test and an open() test, and I was, and still am, trying to figure out why.
        What I did learn from you, was that I should apparently not blindly convert my filenames to utf8 (which would have been my first inclination). Thanks.

        However, inasmuch as the main issue goes, I now am close to believing that there are gremlins at play. I am trying to show more clearly the problem I indicated at the start, but now when I am creating and dropping Windows files (with accented characters in the names) into the DAV directories, they are being picked up and read by the Perl script, accents and all, and the filenames seem to be Latin-1, not utf-8 anymore.
        The Windows PC I am dropping the files from, is a Spanish Windows XP station, set up with a Spanish keyboard and all.
        The only thing I changed in my script, I swear, was to add an additional logging message, showing the $! error code in case of the failed open().
        The only explanation I can think of at the moment is that somehow Windows XP, with regards to filenames, has the capability to create them in either Windows Latin-1 charset or Unicode utf-8 (and encode this information in it's directory entry ?). Would you know any Windows guru that could confirm/infirm this ?
        If that's not true, then I will need to write a little server-side script that forcefully creates several versions of the testfile name (utf8-encoded and not) directly on the Linux server. I'll get busy on that anyway, if only to simplify the issue.

        On another plane, I am new on this site, and do not want to encumber it with the step-by-step resolution to the problem. Suppose I take a break now, and come back to this thread when I have a solid description of what happens. Do I just pick up your last message and hit reply, or start a new question ?