After reading perldoc perlunicode it seems that there's some conflict in perl between support for locales and unicode. At least "use locale" breaks certain features of unicode that work without it. This got me puzzled. From general consideration, there should be nothing like that. Of course, it might be that my "general considerations" are simply wrong, so I've decided to ask for opinion of other perl developers.

As far as I understand, Unicode defines almost everything necessary for handling characters. At least Unicode support of perl provides lookup for various properties of characters ("\p{Uppercase}" etc.) I believe this is mostly enough for text matching and case conversion. Unicode also provides collation charts, but I don't know if they supported in perl. Anyway. The point is, perl is pretty smart with handling characters ones those are identified.

Where comes the conflict with locales from? Again, as far as I understand, locale defines set of rules that are common for the environment. These rules include collation for sorting, characters encoding, language of messages etc. All of this is advisory. So, it shouldn't come into conflict with anything. Why does it conflict with perl operation?

In general, I would believe that locale settings should be the source of defaults for perl. For example, in the absence of "use utf8", the perl should believe that the file is encoded using character set defined in locale. Again, in the absence of explicit "binmode" for file handles, the perl should believe that the input is encoded using character set defined in locale. This should help perl with conversion from octets into unicode characters. Once this conversion is done, the locale setting is not needed any more. This means, that string matching should not care about locale, unless it got octects in place of characters for matching.

In short, the locale support should be just an extra level in providing defaults. If "use locale" is not present, then default encoding for "octects" is Latin1. In the presence of "use locale" the default encoding would be whatever defined by locale.

If it were done this way, then the code like

use utf8; use locale; my $tst = "wär war"; die "No match\n" unless $tst =~ /(\w+)/; print $1, "\n";
would produce correct output "wär" and not "w". More than that, the switch -C would not be required for running this code.

Do I miss something in my understanding?


In reply to Locale and Unicode, enemies in perl? by andal

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.