After reading perldoc perlunicode it seems that there's some conflict in perl between support for locales and unicode. At least "use locale" breaks certain features of unicode that work without it. This got me puzzled. From general consideration, there should be nothing like that. Of course, it might be that my "general considerations" are simply wrong, so I've decided to ask for opinion of other perl developers.
As far as I understand, Unicode defines almost everything necessary for handling characters. At least Unicode support of perl provides lookup for various properties of characters ("\p{Uppercase}" etc.) I believe this is mostly enough for text matching and case conversion. Unicode also provides collation charts, but I don't know if they supported in perl. Anyway. The point is, perl is pretty smart with handling characters ones those are identified.
Where comes the conflict with locales from? Again, as far as I understand, locale defines set of rules that are common for the environment. These rules include collation for sorting, characters encoding, language of messages etc. All of this is advisory. So, it shouldn't come into conflict with anything. Why does it conflict with perl operation?
In general, I would believe that locale settings should be the source of defaults for perl. For example, in the absence of "use utf8", the perl should believe that the file is encoded using character set defined in locale. Again, in the absence of explicit "binmode" for file handles, the perl should believe that the input is encoded using character set defined in locale. This should help perl with conversion from octets into unicode characters. Once this conversion is done, the locale setting is not needed any more. This means, that string matching should not care about locale, unless it got octects in place of characters for matching.
In short, the locale support should be just an extra level in providing defaults. If "use locale" is not present, then default encoding for "octects" is Latin1. In the presence of "use locale" the default encoding would be whatever defined by locale.
If it were done this way, then the code like
would produce correct output "wär" and not "w". More than that, the switch -C would not be required for running this code.use utf8; use locale; my $tst = "wär war"; die "No match\n" unless $tst =~ /(\w+)/; print $1, "\n";
Do I miss something in my understanding?
In reply to Locale and Unicode, enemies in perl? by andal
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |