Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi, i'm confused about locales and unicode. i've already tried asking on #perl but that only made the problem clearer, not the solution. the problem is the following
# LC_ALL=hu_HU (the encoding for the locale is latin2) use open qw/:locale/; use locale; $a = <>; # read in eg: 'é' in latin2, converted correctly $a =~ m/\w/; $a =~ m/[\w]/; # both match
however, the same code, with LC_ALL=hu_HU.UTF-8 (and receiving an utf8-encoded 'é'), the first match fails. the second succeeds. if i remove the use locale in the second case, both match. why is this?

Replies are listed 'Best First'.
Re: use locale behavior depends on charset of locale?
by ikegami (Patriarch) on Jul 10, 2009 at 15:58 UTC

    A note before starting: IIRC, use open doesn't work too well with ARGV (<> is short for <ARGV>), but it works here since you end up reading from STDIN (not a file named on the command line).

    I'm having problems getting any locale except en_CA.utf8 working, so I'll use that one. First, I changed your program somewhat:

    use open qw/:locale/; BEGIN { require locale; locale->import() if shift; } use Data::Dumper qw( Dumper ); sub dump_str { my ($s) = @_; my $internal_enc = utf8::is_utf8($s) ? "utf8" : "iso-latin-1"; local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Indent = 0; print("Input = ", Dumper($s), " [$internal_enc]\n"); } print(join(':', PerlIO::get_layers(STDIN)), "\n"); my $s = <>; dump_str($s); print "Outside char class: ", $s =~ m/\w/ ? "" : "no ", "match\n"; print "Inside char class: ", $s =~ m/[\w]/ ? "" : "no ", "match\n";
    $ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0xC9' | LANG=e +n_CA.utf8 perl a.pl 0 unix:perlio:utf8 Input = "\x{c9}" [utf8] Outside char class: match Inside char class: match $ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0xC9' | LANG=e +n_CA.utf8 perl a.pl 1 unix:perlio:utf8 Input = "\x{c9}" [utf8] Outside char class: no match Inside char class: match

    (5.8.8 on Debian)

    The important addition is the display of the internal coding of the input. Match operations base some of their behaviour on the internal encoding of the string being matched.

      Oops, I didn't mean to post so soon since I haven't figured out what's going on yet. I wanted to try this on 5.10 before doing anything else since this is obviously a bug. (The behaviour of \w shouldn't change based on whether it's in a char class or not.)

      Well, at least it provides more details about the problem and clear instructions on how to reproduce it. Could someone run this on Perl 5.10 for me? I don't have a unix box with that version handy.

        Here are results from a default build of perl 5.10.0 on CentOS 5.3

        [ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' | LANG=en_CA.utf8 perl test.pl 0 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: match Inside char class: match [ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' | LANG=en_CA.utf8 perl test.pl 1 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: no match Inside char class: match
        The behaviour of \w shouldn't change based on whether it's in a char class or not.

        But running with -Dr shows that these regexes compiled differently. And it looks like the problem is not in IO, I've tried the following:

        #use locale; use utf8; $a = 'é'; $a =~ m/\w/ and print "first re match\n"; $a =~ m/[\w]/ and print "second re match\n";
        if use locale is commented out it outputs:
        first re match second re match
        if use locale uncommented only:
        second re match
        My locale en_US.UTF-8 BTW.

        debugperl -Dr shows the following:

        Without locale: Compiling REx "\w" Final program: 1: ALNUM (2) 2: END (0) stclass ALNUM minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF[0-9A-Z_a-z+utf8::IsWord] (13) 13: END (0) stclass ANYOF[0-9A-Z_a-z+utf8::IsWord] minlen 1 with locale: Compiling REx "\w" Final program: 1: ALNUML (2) 2: END (0) stclass ALNUML minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF{loc}[\w+utf8::IsWord] (13) 13: END (0) stclass ANYOF{loc}[\w+utf8::IsWord] minlen 1

        I don't really understand this output very well, but it looks like it treats this regexes differently.

Re: use locale behavior depends on charset of locale?
by Anonymous Monk on Jul 10, 2009 at 13:39 UTC
    Check
    print join( ':', PerlIO::get_layers(STDIN)), "\n";