"where your idea of assuming that "all filenames are UTF-8" breaks, "
Would you stop making false claims about what people say? I made no assertions that ALL filenames are UTF-8" Do you do that deliberately: quote people slightly out of context to start an argument? Or is is really accidental? I said:
... and now most linux systems are using UTF-8 as native encoding and perl has no mechanism to deal with this? You'd think -CSD would have given it a hint that all Data is to be considered UTF-8 encoded..

Most linux systems use UTF-8 as native encoding, is not anywhere even close to "all filenames are UTF-8".... Now I could say all HTML5 defaults to UTF-8 encoding, and that would be correct. But filenames on linux are just stringZ's, That means you can put just about anything in them. Since most distro's are using UTF-8 these days. AFAIK, there are no current, *mainstream* Linux filesystems that don't support UTF-8. NTFS/Win32 don't count as mainstream linux filesystems, though NTFS supports any character in a file name (including NULL's), as the NTFS file calls take a length-based filename (it's the Win32 calls that put character limits on filenames -- and registry keys...)..

Sigh....I think a wrapper around 'find' (-depth 1) might be the easiest, -- but I take input from readdir and try to determine if it is a file or a dir, so I pass the names directly to -f/-d, and it doesn't work. You'd think with the chars actually having attributes, it wouldn't run a conversion on them to LATIN1 before -f/-d -- i.e. if they were read in as byte strings, they should be passed as bytestrings to -f/-d... that should work fine. But Perl changes the encoding and does so incorrectly. So yeah, I'd call that the perl unicode bug -- EXTREME!!...

The problem comes down to the 128-255 range, where some perl developers are under the mistaken impression that such characters are UTF-8 compatible as is -- they are not. All characters over 127 require 2 or more characters to represent them.

There is even crap in the perl documents that UTF-8 documents need to have a BOM -- something that goes against the Unicode standard (only MS has such requirements).

The fix is simple -- if someone is in a Unicode/UTF-8 locale, then any char with the high bit set is a multi-byte character (2-4 bytes). In fact, ALL UTF-8 bytes > 127 have the high bit set. 0x80 is encoded as 0xc2,0x80, and 0xff is encoded as 0xc3,0xbf, all read left-to right(low-to-high). There is no endian issue with UTF-8, thus no need for a BOM. The program 'file' on linux does a pretty good job (though not perfect) of categorizing something as UTF-8 or ASCII...

Start with ... as of perl 5.X.0 (x>=18), perl treats defaults to treating high bit set bytes, as already UTF-8 encoded and doesn't "upgrade them" from a provincial locale. To get old behavior, use "xxxx" (default to locale)...


In reply to Re^4: how to unicode filenames? by perl-diddler
in thread how to unicode filenames? by perl-diddler

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.