in reply to how to unicode filenames?

File systems and functions are not encoding clean. On Linux and many other unixish systems, the filename string gets passed through "raw" from the filesystem driver, and the receiving userspace application has to decide on the encoding of the filename. See the "Bugs" section of utf8.

Also see unicode version of readdir, directories and charsets

Replies are listed 'Best First'.
Re^2: how to unicode filenames?
by perl-diddler (Chaplain) on Jun 27, 2012 at 10:04 UTC
    So the problem has been around for 5 years -- and now most linux systems are using UTF-8 as native encoding and perl has no mechanism to deal with this?

    You'd think -CSD would have given it a hint that all Data is to be considered UTF-8 encoded......

    This is highly gross.

      Luckily, Perl is not restricted to Linux.

      Feel free to implement the appropriate layer for Linux - perltodo lists the relevant functions and some thoughts that need to be considered. Especially there are filesystems where your idea of assuming that "all filenames are UTF-8" breaks, mount for example lists the various encodings that a filesystem can provide, and usually these are passed through by the various layers straight to your application.

        "where your idea of assuming that "all filenames are UTF-8" breaks, "
        Would you stop making false claims about what people say? I made no assertions that ALL filenames are UTF-8" Do you do that deliberately: quote people slightly out of context to start an argument? Or is is really accidental? I said:
        ... and now most linux systems are using UTF-8 as native encoding and perl has no mechanism to deal with this? You'd think -CSD would have given it a hint that all Data is to be considered UTF-8 encoded..

        Most linux systems use UTF-8 as native encoding, is not anywhere even close to "all filenames are UTF-8".... Now I could say all HTML5 defaults to UTF-8 encoding, and that would be correct. But filenames on linux are just stringZ's, That means you can put just about anything in them. Since most distro's are using UTF-8 these days. AFAIK, there are no current, *mainstream* Linux filesystems that don't support UTF-8. NTFS/Win32 don't count as mainstream linux filesystems, though NTFS supports any character in a file name (including NULL's), as the NTFS file calls take a length-based filename (it's the Win32 calls that put character limits on filenames -- and registry keys...)..

        Sigh....I think a wrapper around 'find' (-depth 1) might be the easiest, -- but I take input from readdir and try to determine if it is a file or a dir, so I pass the names directly to -f/-d, and it doesn't work. You'd think with the chars actually having attributes, it wouldn't run a conversion on them to LATIN1 before -f/-d -- i.e. if they were read in as byte strings, they should be passed as bytestrings to -f/-d... that should work fine. But Perl changes the encoding and does so incorrectly. So yeah, I'd call that the perl unicode bug -- EXTREME!!...

        The problem comes down to the 128-255 range, where some perl developers are under the mistaken impression that such characters are UTF-8 compatible as is -- they are not. All characters over 127 require 2 or more characters to represent them.

        There is even crap in the perl documents that UTF-8 documents need to have a BOM -- something that goes against the Unicode standard (only MS has such requirements).

        The fix is simple -- if someone is in a Unicode/UTF-8 locale, then any char with the high bit set is a multi-byte character (2-4 bytes). In fact, ALL UTF-8 bytes > 127 have the high bit set. 0x80 is encoded as 0xc2,0x80, and 0xff is encoded as 0xc3,0xbf, all read left-to right(low-to-high). There is no endian issue with UTF-8, thus no need for a BOM. The program 'file' on linux does a pretty good job (though not perfect) of categorizing something as UTF-8 or ASCII...

        Start with ... as of perl 5.X.0 (x>=18), perl treats defaults to treating high bit set bytes, as already UTF-8 encoded and doesn't "upgrade them" from a provincial locale. To get old behavior, use "xxxx" (default to locale)...