in reply to Re: utf8 in directory and filenames
in thread utf8 in directory and filenames

Sorry Juerd,
but I cannot open the links you mention at the top of your reply (I'm getting your 404 ever-varying denials, and cannot figure out how to edit the links to make them work).
As to "You can use it to open the file..", that's wrong. The "-f" test fails on such a file entry, and the open() also (I don't have the specific $! error code, I'll have to modify the script for that). If I manually change the filename on the server (e.g. removing the accented character, leaving e.g. "Presentacin.ppt"), then the "-f" succeeds and the open() also (without changing anything else in the script).
Also sorry, I indicated the wrong version of Perl : it is 5.8.3, not 5.8.8 on that server. I did go through the "Changes" sections from 5.8.3 to 5.8.8 though, without finding anything directly on the subject.
The problem occurs in a rather large piece of code; I'll try to make a short testcase version that I can post here.
Thanks for your interest.

Replies are listed 'Best First'.
Re^3: utf8 in directory and filenames
by Juerd (Abbot) on Nov 13, 2006 at 18:14 UTC

    but I cannot open the links you mention at the top of your reply (I'm getting your 404 ever-varying denials, and cannot figure out how to edit the links to make them work).

    Oops. Used Mediawiki syntax. They're now corrected and tested.

    As to "You can use it to open the file..", that's wrong.

    I'll want to see the error message (and then probably some more info) before I agree. But you can read "You can use it" as "You should be able to use it" in the meantime :)

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      I have read the documents in question, and understand what they say. Really. I have been handling these character set issues for a while, including the Unicode/ISO conversions back and forth in Perl (and iso-8859-x, and "modified utf-7 for IMAP", etc..). I just thought I'd insist on this so that you wouldn't think that I don't understand the basic issues at hand.

      My problem is, specifically, that I had a case where a filename entry that was created by a Windows PC, on a Linux server, and which did contain utf-8 encoded characters, failed the "-f" test and an open() test, and I was, and still am, trying to figure out why.
      What I did learn from you, was that I should apparently not blindly convert my filenames to utf8 (which would have been my first inclination). Thanks.

      However, inasmuch as the main issue goes, I now am close to believing that there are gremlins at play. I am trying to show more clearly the problem I indicated at the start, but now when I am creating and dropping Windows files (with accented characters in the names) into the DAV directories, they are being picked up and read by the Perl script, accents and all, and the filenames seem to be Latin-1, not utf-8 anymore.
      The Windows PC I am dropping the files from, is a Spanish Windows XP station, set up with a Spanish keyboard and all.
      The only thing I changed in my script, I swear, was to add an additional logging message, showing the $! error code in case of the failed open().
      The only explanation I can think of at the moment is that somehow Windows XP, with regards to filenames, has the capability to create them in either Windows Latin-1 charset or Unicode utf-8 (and encode this information in it's directory entry ?). Would you know any Windows guru that could confirm/infirm this ?
      If that's not true, then I will need to write a little server-side script that forcefully creates several versions of the testfile name (utf8-encoded and not) directly on the Linux server. I'll get busy on that anyway, if only to simplify the issue.

      On another plane, I am new on this site, and do not want to encumber it with the step-by-step resolution to the problem. Suppose I take a break now, and come back to this thread when I have a solid description of what happens. Do I just pick up your last message and hit reply, or start a new question ?

        I have read the documents in question, and understand what they say. Really. I have been handling these character set issues for a while, including the Unicode/ISO conversions back and forth in Perl (and iso-8859-x, and "modified utf-7 for IMAP", etc..). I just thought I'd insist on this so that you wouldn't think that I don't understand the basic issues at hand.

        Do you understand the difference between a Perl unicode string, and a UTF-8 encoded string? That's a bit more complicated than converting between encodings back and forth, and it's the key issue at hand.

        What I did learn from you, was that I should apparently not blindly convert my filenames to utf8

        Or anything else. A filename, once converted or encoded, is no longer the same filename.

        failed the "-f" test and an open() test, and I was, and still am, trying to figure out why.

        You really, really need to have the error message. If you don't want to output it to STDERR or STDOUT, you can open a log file and write it there. Without the error message, you can only guess what's wrong. Guessing absolutely sucks, because it takes too much time.

        I now am close to believing that there are gremlins at play.

        If you're on Linux, use strace(1) to find where the gremlins are.

        Do I just pick up your last message and hit reply, or start a new question ?

        You can continue with the old thread, but it's harder to notice the new message then. I hate to say this, but you're better off starting a new thread. Don't forget to refer to the old one.

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }