in reply to use locale behavior depends on charset of locale?

A note before starting: IIRC, use open doesn't work too well with ARGV (<> is short for <ARGV>), but it works here since you end up reading from STDIN (not a file named on the command line).

I'm having problems getting any locale except en_CA.utf8 working, so I'll use that one. First, I changed your program somewhat:

use open qw/:locale/; BEGIN { require locale; locale->import() if shift; } use Data::Dumper qw( Dumper ); sub dump_str { my ($s) = @_; my $internal_enc = utf8::is_utf8($s) ? "utf8" : "iso-latin-1"; local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Indent = 0; print("Input = ", Dumper($s), " [$internal_enc]\n"); } print(join(':', PerlIO::get_layers(STDIN)), "\n"); my $s = <>; dump_str($s); print "Outside char class: ", $s =~ m/\w/ ? "" : "no ", "match\n"; print "Inside char class: ", $s =~ m/[\w]/ ? "" : "no ", "match\n";
$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0xC9' | LANG=e +n_CA.utf8 perl a.pl 0 unix:perlio:utf8 Input = "\x{c9}" [utf8] Outside char class: match Inside char class: match $ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0xC9' | LANG=e +n_CA.utf8 perl a.pl 1 unix:perlio:utf8 Input = "\x{c9}" [utf8] Outside char class: no match Inside char class: match

(5.8.8 on Debian)

The important addition is the display of the internal coding of the input. Match operations base some of their behaviour on the internal encoding of the string being matched.

Replies are listed 'Best First'.
Re^2: use locale behavior depends on charset of locale?
by ikegami (Patriarch) on Jul 10, 2009 at 16:02 UTC
    Oops, I didn't mean to post so soon since I haven't figured out what's going on yet. I wanted to try this on 5.10 before doing anything else since this is obviously a bug. (The behaviour of \w shouldn't change based on whether it's in a char class or not.)

    Well, at least it provides more details about the problem and clear instructions on how to reproduce it. Could someone run this on Perl 5.10 for me? I don't have a unix box with that version handy.

      Here are results from a default build of perl 5.10.0 on CentOS 5.3

      [ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' | LANG=en_CA.utf8 perl test.pl 0 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: match Inside char class: match [ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' | LANG=en_CA.utf8 perl test.pl 1 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: no match Inside char class: match
      The behaviour of \w shouldn't change based on whether it's in a char class or not.

      But running with -Dr shows that these regexes compiled differently. And it looks like the problem is not in IO, I've tried the following:

      #use locale; use utf8; $a = 'é'; $a =~ m/\w/ and print "first re match\n"; $a =~ m/[\w]/ and print "second re match\n";
      if use locale is commented out it outputs:
      first re match second re match
      if use locale uncommented only:
      second re match
      My locale en_US.UTF-8 BTW.

      debugperl -Dr shows the following:

      Without locale: Compiling REx "\w" Final program: 1: ALNUM (2) 2: END (0) stclass ALNUM minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF[0-9A-Z_a-z+utf8::IsWord] (13) 13: END (0) stclass ANYOF[0-9A-Z_a-z+utf8::IsWord] minlen 1 with locale: Compiling REx "\w" Final program: 1: ALNUML (2) 2: END (0) stclass ALNUML minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF{loc}[\w+utf8::IsWord] (13) 13: END (0) stclass ANYOF{loc}[\w+utf8::IsWord] minlen 1

      I don't really understand this output very well, but it looks like it treats this regexes differently.

        And it looks like the problem is not in IO

        Correct. That's what my changes show.

        I don't really understand this output very well, but it looks like it treats this regexes differently.

        It's not a problem that /\w/ and /[\w]/ compile differently. It's a problem that they don't compile to something equivalent.

        use locale; utf8::upgrade( my $s = chr(0xC9) ); # e-acute print "Outside char class: ", $s =~ m/\w/ ? "" : "no ", "match\n"; print "Inside char class: ", $s =~ m/[\w]/ ? "" : "no ", "match\n";
        LANG=en_CA.utf8 perl test.pl Outside char class: no match Inside char class: match