in reply to Re: use locale behavior depends on charset of locale?
in thread use locale behavior depends on charset of locale?

Oops, I didn't mean to post so soon since I haven't figured out what's going on yet. I wanted to try this on 5.10 before doing anything else since this is obviously a bug. (The behaviour of \w shouldn't change based on whether it's in a char class or not.)

Well, at least it provides more details about the problem and clear instructions on how to reproduce it. Could someone run this on Perl 5.10 for me? I don't have a unix box with that version handy.

Replies are listed 'Best First'.
Re^3: use locale behavior depends on charset of locale?
by ig (Vicar) on Jul 10, 2009 at 17:04 UTC

    Here are results from a default build of perl 5.10.0 on CentOS 5.3

    [ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' | LANG=en_CA.utf8 perl test.pl 0 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: match Inside char class: match [ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' | LANG=en_CA.utf8 perl test.pl 1 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: no match Inside char class: match
Re^3: use locale behavior depends on charset of locale?
by zwon (Abbot) on Jul 10, 2009 at 18:02 UTC
    The behaviour of \w shouldn't change based on whether it's in a char class or not.

    But running with -Dr shows that these regexes compiled differently. And it looks like the problem is not in IO, I've tried the following:

    #use locale; use utf8; $a = 'é'; $a =~ m/\w/ and print "first re match\n"; $a =~ m/[\w]/ and print "second re match\n";
    if use locale is commented out it outputs:
    first re match second re match
    if use locale uncommented only:
    second re match
    My locale en_US.UTF-8 BTW.

    debugperl -Dr shows the following:

    Without locale: Compiling REx "\w" Final program: 1: ALNUM (2) 2: END (0) stclass ALNUM minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF[0-9A-Z_a-z+utf8::IsWord] (13) 13: END (0) stclass ANYOF[0-9A-Z_a-z+utf8::IsWord] minlen 1 with locale: Compiling REx "\w" Final program: 1: ALNUML (2) 2: END (0) stclass ALNUML minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF{loc}[\w+utf8::IsWord] (13) 13: END (0) stclass ANYOF{loc}[\w+utf8::IsWord] minlen 1

    I don't really understand this output very well, but it looks like it treats this regexes differently.

      And it looks like the problem is not in IO

      Correct. That's what my changes show.

      I don't really understand this output very well, but it looks like it treats this regexes differently.

      It's not a problem that /\w/ and /[\w]/ compile differently. It's a problem that they don't compile to something equivalent.

      use locale; utf8::upgrade( my $s = chr(0xC9) ); # e-acute print "Outside char class: ", $s =~ m/\w/ ? "" : "no ", "match\n"; print "Inside char class: ", $s =~ m/[\w]/ ? "" : "no ", "match\n";
      LANG=en_CA.utf8 perl test.pl Outside char class: no match Inside char class: match