Re: use locale behavior depends on charset of locale?

A note before starting: IIRC, use open doesn't work too well with ARGV (<> is short for <ARGV>), but it works here since you end up reading from STDIN (not a file named on the command line).

I'm having problems getting any locale except en_CA.utf8 working, so I'll use that one. First, I changed your program somewhat:

use open qw/:locale/;

BEGIN {
    require locale;
    locale->import() if shift;
}

use Data::Dumper qw( Dumper );

sub dump_str {
    my ($s) = @_;
    my $internal_enc = utf8::is_utf8($s) ? "utf8" : "iso-latin-1";
    local $Data::Dumper::Useqq  = 1;
    local $Data::Dumper::Terse  = 1;
    local $Data::Dumper::Indent = 0;
    print("Input = ", Dumper($s), " [$internal_enc]\n");
}

print(join(':', PerlIO::get_layers(STDIN)), "\n");

my $s = <>;
dump_str($s);

print "Outside char class: ", $s =~ m/\w/   ? "" : "no ", "match\n";
print "Inside char class:  ", $s =~ m/[\w]/ ? "" : "no ", "match\n";
[download]

$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0xC9' | LANG=e
+n_CA.utf8 perl a.pl 0
unix:perlio:utf8
Input = "\x{c9}" [utf8]
Outside char class: match
Inside char class:  match

$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0xC9' | LANG=e
+n_CA.utf8 perl a.pl 1
unix:perlio:utf8
Input = "\x{c9}" [utf8]
Outside char class: no match
Inside char class:  match
[download]

(5.8.8 on Debian)

The important addition is the display of the internal coding of the input. Match operations base some of their behaviour on the internal encoding of the string being matched.

Comment on Re: use locale behavior depends on charset of locale? Select or Download Code

Replies are listed 'Best First'.
Re^2: use locale behavior depends on charset of locale? by ikegami (Patriarch) on Jul 10, 2009 at 16:02 UTC
Oops, I didn't mean to post so soon since I haven't figured out what's going on yet. I wanted to try this on 5.10 before doing anything else since this is obviously a bug. (The behaviour of `\w` shouldn't change based on whether it's in a char class or not.) Well, at least it provides more details about the problem and clear instructions on how to reproduce it. Could someone run this on Perl 5.10 for me? I don't have a unix box with that version handy.	[reply] [d/l]
Re^3: use locale behavior depends on charset of locale? by ig (Vicar) on Jul 10, 2009 at 17:04 UTC
Here are results from a default build of perl 5.10.0 on CentOS 5.3 `[ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' \| LANG=en_CA.utf8 perl test.pl 0 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: match Inside char class: match [ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' \| LANG=en_CA.utf8 perl test.pl 1 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: no match Inside char class: match` [download]	[reply] [d/l]
Re^3: use locale behavior depends on charset of locale? by zwon (Abbot) on Jul 10, 2009 at 18:02 UTC
The behaviour of \w shouldn't change based on whether it's in a char class or not. But running with -Dr shows that these regexes compiled differently. And it looks like the problem is not in IO, I've tried the following: `#use locale; use utf8; $a = 'é'; $a =~ m/\w/ and print "first re match\n"; $a =~ m/[\w]/ and print "second re match\n";` [download] if use locale is commented out it outputs: `first re match second re match` [download] if use locale uncommented only: `second re match` [download] My locale en_US.UTF-8 BTW. `debugperl -Dr` shows the following: `Without locale: Compiling REx "\w" Final program: 1: ALNUM (2) 2: END (0) stclass ALNUM minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF[0-9A-Z_a-z+utf8::IsWord] (13) 13: END (0) stclass ANYOF[0-9A-Z_a-z+utf8::IsWord] minlen 1 with locale: Compiling REx "\w" Final program: 1: ALNUML (2) 2: END (0) stclass ALNUML minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF{loc}[\w+utf8::IsWord] (13) 13: END (0) stclass ANYOF{loc}[\w+utf8::IsWord] minlen 1` [download] I don't really understand this output very well, but it looks like it treats this regexes differently.	[reply] [d/l] [select]
Re^4: use locale behavior depends on charset of locale? by ikegami (Patriarch) on Jul 10, 2009 at 18:35 UTC
And it looks like the problem is not in IO Correct. That's what my changes show. I don't really understand this output very well, but it looks like it treats this regexes differently. It's not a problem that `/\w/` and `/[\w]/` compile differently. It's a problem that they don't compile to something equivalent. `use locale; utf8::upgrade( my $s = chr(0xC9) ); # e-acute print "Outside char class: ", $s =~ m/\w/ ? "" : "no ", "match\n"; print "Inside char class: ", $s =~ m/[\w]/ ? "" : "no ", "match\n";` [download] `LANG=en_CA.utf8 perl test.pl Outside char class: no match Inside char class: match` [download]	[reply] [d/l] [select]