That's exactly why I've written my post :) use locale breaks the Unicode.

Oh, this is quite well-known: POSIX locales are for bad old legacy scripts that aren’t Unicode-aware, and which rely on some 8-bit encoding for binary bytes, its LC_CTYPE and/or LC_COLLATE values, rather than setting the encoding properly and reading everything into Unicode characters instead of icky locale bytes.

Unicode provides for much more robust character handling than do POSIX locales, whether this is for case mapping, collating, or really anything else having to do with characters.

Here is a relevant excerpt from Perl 5.14’s perllocale manpage, with underlining mine:

perl 5.14’s perllocale manpage says...

Locales these days have been mostly been supplanted by Unicode, but Perl continues to support them.

The support of Unicode is new starting from Perl version 5.6, and more fully implemented in version 5.8, and later. See the perluniintro manpage. Perl tries to work with both Unicode and locales. But, of course, there are problems.

Perl does not handle multi-byte locales, such as have been used for various Asian languages, such as Big5 or Shift JIS. However, the multi-byte, increasingly common, UTF-8 locales, if properly implemented, tend to work reasonably well in Perl, simply because both they and Perl store the characters that take up multiple bytes the same way.

Perl generally takes the tack to use locale rules on code points that can fit in a single byte, and Unicode rules for those that can’t (though this wasn’t uniformly applied prior to Perl 5.14). This prevents many problems in locales that aren’t UTF-8. Suppose the locale is ISO8859-7, Greek. The character at 0xD7 there is a capital Chi. But in the ISO8859-1 locale, Latin1, it is a multiplication sign. The POSIX regular expression character class [[:alpha:]] will magically match 0xD7 in the Greek locale, but not in the Latin, even if the string is encoded in UTF-8, which normally would imply Unicode. (The “U” in UTF-8 stands for Unicode.)

However, there are places where this breaks down. Certain constructs are for Unicode only, such as \p{Alpha}. They assume that 0xD7 always has the Unicode meaning (or its equivalent on EBCDIC platforms). Since Latin1 is a subset of Unicode, 0xD7 is the multiplication sign in Unicode, so \p{Alpha} will not match it, regardless of locale. A similar issue happens with \N{...}. Therefore, it is a bad idea to use \p{} or \N{} under locale unless you know that the locale is always going to be ISO8859-1 or a UTF-8 one. Use the POSIX character classes instead.

The same problem ensues if you enable automatic UTF-8-ification of your standard file handles, default open() layer, and @ARGV on non-ISO8859-1, non-UTF-8 locales (by using either the -C command line switch or the PERL_UNICODE environment variable; see the perlrun manpage for the documentation of the -C switch). Things are read in as UTF-8 which would normally imply a Unicode interpretation, but the presence of locale causes them to be interpreted in that locale, so a 0xD7 code point in the input will have meant the multiplication sign, but won’t be interpreted by Perl that way in the Greek locale. Again, this is not a problem if you know that the locales are always going to be ISO8859-1 or UTF-8.

Vendor locales are notoriously buggy, and it is difficult for Perl to test its locale handling code because it interacts with code that Perl has no control over; therefore the locale handling code in Perl may be buggy as well. But if you do have locales that work, it may be worthwhile using them, keeping in mind the gotchas already mentioned. Locale collation is faster than Unicode::Collate, for example, and you gain access to things such as the currency symbol and days of the week.

BUGS

Broken systems

In certain systems, the operating system’s locale support is broken and cannot be fixed or used by Perl. Such deficiencies can and will result in mysterious hangs and/or Perl core dumps when the use locale is in effect. When confronted with such a system, please report in excruciating detail to <perlbug@perl.org>, and complain to your vendor: bug fixes may exist for these problems in your operating system. Sometimes such bug fixes are called an operating system upgrade.

My personal advice is to strongly avoid vendor locales. It’s not a legacy you want to see propagated.


In reply to Re^3: Locale and Unicode, enemies in perl? by tchrist
in thread Locale and Unicode, enemies in perl? by andal

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.