in reply to Re: directories and charsets
in thread directories and charsets

Yesssss, thank you !

That is exactly the kind of problem I was talking about in my first convoluted message.

From the documentation (perl Unicode etc..) and from my personal tests, it would seem that readdir() always returns strings that are "bytes" (not internally marked as "utf8" by Perl). This is per the Encode::is_utf8($dir_entry) function.

However, at some point it seems that after concatenating that directory entry with, for instance, the directory path whence it comes, and trying a "if (-f $fullpath)", the answer is false.

I was now testing on a Windows machine, and I thought that Windows NTFS was storing filenames as UTF-8. But you seem to say that this is not true, and that it is UCS-2 instead. That might explain why, when trying various permutations and encodings or decodings of my filenames, I am getting errors.

Back to testing thus, with this exciting new possibility..

Replies are listed 'Best First'.
Re^3: directories and charsets
by jbert (Priest) on Mar 15, 2007 at 16:46 UTC
    If you concatenate a utf8-tagged string with a non-utf8 tagged string, perl will silently "upgrade" the non-utf8 string to utf8, converting it under the assumption that it is in the local encoding (normally latin1, but might be settable with locale, PERL_ENCODING env or similar).

    There is a module to warn you when this happens (can't remember what it's called though).

    If such an untagged string already contains utf8 byte sequences, this will give you an incorrect double-encoding of the string.

    It seems to me that one way to get the right behaviour is to do:

    my @files = map { Encode::_utf8_on($_); } readdir DIRHANDLE;
    when reading names from a utf8-named-filesystem.

    I could be wrong on the NTFS thing, it's just that UCS-2 (UTF-16-a-like) is *very* entrenched on Windows, I'd be very surprised if NTFS wasn't using that as it's native format. (Of course, you may well see it as utf8 when you mount the share with smbfs, I'd expect smbfs to do that translation for you, but maybe it's a mount option or something).