use locale behavior depends on charset of locale?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: use locale behavior depends on charset of locale? by ikegami (Patriarch) on Jul 10, 2009 at 15:58 UTC
A note before starting: IIRC, `use open` doesn't work too well with `ARGV` (`<>` is short for `<ARGV>`), but it works here since you end up reading from STDIN (not a file named on the command line). I'm having problems getting any locale except `en_CA.utf8` working, so I'll use that one. First, I changed your program somewhat: use open qw/:locale/; BEGIN { require locale; locale->import() if shift; } use Data::Dumper qw( Dumper ); sub dump_str { my ($s) = @_; my $internal_enc = utf8::is_utf8($s) ? "utf8" : "iso-latin-1"; local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Indent = 0; print("Input = ", Dumper($s), " [$internal_enc]\n"); } print(join(':', PerlIO::get_layers(STDIN)), "\n"); my $s = <>; dump_str($s); print "Outside char class: ", $s =~ m/\w/ ? "" : "no ", "match\n"; print "Inside char class: ", $s =~ m/[\w]/ ? "" : "no ", "match\n"; [download] `$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0xC9' \| LANG=e +n_CA.utf8 perl a.pl 0 unix:perlio:utf8 Input = "\x{c9}" [utf8] Outside char class: match Inside char class: match $ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0xC9' \| LANG=e +n_CA.utf8 perl a.pl 1 unix:perlio:utf8 Input = "\x{c9}" [utf8] Outside char class: no match Inside char class: match` [download] (5.8.8 on Debian) The important addition is the display of the internal coding of the input. Match operations base some of their behaviour on the internal encoding of the string being matched.	[reply] [d/l] [select]
Re^2: use locale behavior depends on charset of locale? by ikegami (Patriarch) on Jul 10, 2009 at 16:02 UTC
Oops, I didn't mean to post so soon since I haven't figured out what's going on yet. I wanted to try this on 5.10 before doing anything else since this is obviously a bug. (The behaviour of `\w` shouldn't change based on whether it's in a char class or not.) Well, at least it provides more details about the problem and clear instructions on how to reproduce it. Could someone run this on Perl 5.10 for me? I don't have a unix box with that version handy.	[reply] [d/l]
Re^3: use locale behavior depends on charset of locale? by ig (Vicar) on Jul 10, 2009 at 17:04 UTC
Here are results from a default build of perl 5.10.0 on CentOS 5.3 `[ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' \| LANG=en_CA.utf8 perl test.pl 0 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: match Inside char class: match [ian@alula ~]$ perl -e'binmode STDOUT, ":encoding(UTF-8)"; print chr 0 +xc9' \| LANG=en_CA.utf8 perl test.pl 1 unix:perlio:encoding(utf-8-strict):utf8 Input = "\x{c9}" [utf8] Outside char class: no match Inside char class: match` [download]	[reply] [d/l]
Re^3: use locale behavior depends on charset of locale? by zwon (Abbot) on Jul 10, 2009 at 18:02 UTC
The behaviour of \w shouldn't change based on whether it's in a char class or not. But running with -Dr shows that these regexes compiled differently. And it looks like the problem is not in IO, I've tried the following: `#use locale; use utf8; $a = 'é'; $a =~ m/\w/ and print "first re match\n"; $a =~ m/[\w]/ and print "second re match\n";` [download] if use locale is commented out it outputs: `first re match second re match` [download] if use locale uncommented only: `second re match` [download] My locale en_US.UTF-8 BTW. `debugperl -Dr` shows the following: `Without locale: Compiling REx "\w" Final program: 1: ALNUM (2) 2: END (0) stclass ALNUM minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF[0-9A-Z_a-z+utf8::IsWord] (13) 13: END (0) stclass ANYOF[0-9A-Z_a-z+utf8::IsWord] minlen 1 with locale: Compiling REx "\w" Final program: 1: ALNUML (2) 2: END (0) stclass ALNUML minlen 1 Compiling REx "[\w]" Final program: 1: ANYOF{loc}[\w+utf8::IsWord] (13) 13: END (0) stclass ANYOF{loc}[\w+utf8::IsWord] minlen 1` [download] I don't really understand this output very well, but it looks like it treats this regexes differently.	[reply] [d/l] [select]
Re^4: use locale behavior depends on charset of locale? by ikegami (Patriarch) on Jul 10, 2009 at 18:35 UTC
Re: use locale behavior depends on charset of locale? by Anonymous Monk on Jul 10, 2009 at 13:39 UTC
Check `print join( ':', PerlIO::get_layers(STDIN)), "\n";` [download]	[reply] [d/l]