ropey has asked for the wisdom of the Perl Monks concerning the following question:

I just can't seem to get my head around locales... although looking through numerous posts...

So I have a multi lingual application, I want to match names based on the locale of the language (well at least its only Western European

...

So I would expect this to work

use POSIX 'locale_h'; my $loc = 'de_DE.UTF-8'; # German locale, for example. Run 'locale -a +' #+ to get the exact locale name setlocale(LC_CTYPE, $loc) my $suspect = "vähicule"; if($suspect =~ /^\w+$/) { print STDERR "MATCHES\n"; }

I would expect the regex to match.. it doesn't... I am missing something....

...

IS there a better way of doing such things ? all I need to do is check to see if a input matches as per the lang set....

Replies are listed 'Best First'.
Re: Locale Woes...
by fenLisesi (Priest) on Jun 04, 2007 at 11:15 UTC
    Something like the following may help (encode_monk and decode_monk are only there to save you potential character set problems with your text editor -- don't let them distract you). Cheers.
    use strict; use warnings; use Encode 'decode_utf8'; my $suspect = decode_utf8( decode_monk('v~C3~A4hicule') ); warn '$suspect='.$suspect; if ($suspect =~ /^([\p{IsAlnum}]+)$/) { warn "MATCHES $1\n"; } exit( 0 ); ##------------------------------------------------------------------+ ## WARNING: Using sprintf here is too expensive. A ## lookup table may be the best solution. sub encode_monk { my ($txt) = @_; join q(), map { ( $_ >= 0x20 && $_ <= 0x7D ) ? ( sprintf "%c", $_ ) : ( sprintf "~%02X", $_ ) } unpack "C*", $txt; } ##------------------------------------------------------------------+ ## WARNING: Consider err policy here. sub decode_monk { my ($code) = @_; if (! defined $code or length( $code ) == 0) { return; } $code =~ s/\~([\da-fA-F]{2})/chr( hex( $1 ) )/eg; return $code; }
Re: Locale Woes...
by graff (Chancellor) on Jun 04, 2007 at 12:21 UTC
    You seem to be taking it for granted that the file containing your perl script is utf8-encoded, so that the "ä" character is stored in the script file as a two-byte sequence (0xC3 0xA4). You should use some hex-dump tool to confirm this (unix commands "od" or "xxd", or some other perl script that uses binmode).

    Then, once you know the file really is stored with utf8 encoding, you need to add use utf8; at the top of the script. That tells the perl interpreter that string literals in the script should have the utf8-flag turned on (and are expected to contain wide characters).

    The following version of your code works for me (v5.8.6 built for darwin-thread-multi-2level) -- note that your script was missing a semicolon on the "setlocale" line, but in fact, you don't even need to worry about locale in order for the match to work.

    use utf8; my $suspect = "vähicule"; if($suspect =~ /^\w+$/) { print STDERR "MATCHES\n"; }
    If you'll be doing something else in the real script that requires locale settings, that's a separate issue. Regex matches and other character-based operations depend only on the utf8 flag settings of scalars, not on locale.
Re: Locale Woes...
by shmem (Chancellor) on Jun 04, 2007 at 12:25 UTC
    You don't want LC_TYPE, you want LC_COLLATE for that:
    use locale; use POSIX qw(locale_h); setlocale(LC_COLLATE,"de_DE") or die "foo? - $!\n"; print "yup" if "fähler"=~/^\w+$/; __END__ yup

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      I can't get shmem's solution to work unless I add use utf8 and operate on LC_CTYPE. Based on the Location, one might guess that shmem's LC_CTYPE is already 'de_DE'?
      $ perl -l use strict; use warnings; use locale; use POSIX qw(locale_h); print setlocale(LC_COLLATE,"de_DE") or die "foo? - $!\n"; ("fähler"=~/^\w+$/) ? print "yup" : print "nope"; __END__ de_DE nope
      $ perl -l use strict; use warnings; use utf8; use locale; use POSIX qw(locale_h); print setlocale(LC_COLLATE,"de_DE") or die "foo? - $!\n"; ("fähler"=~/^\w+$/) ? print "yup" : print "nope"; __END__ de_DE nope
      $ perl -l use strict; use warnings; use utf8; use locale; use POSIX qw(locale_h); print setlocale(LC_CTYPE,"de_DE") or die "foo? - $!\n"; ("fähler"=~/^\w+$/) ? print "yup" : print "nope"; __END__ de_DE yup
      --
      print map{chr}unpack(q{A3}x24,q{074117115116032097110111116104101114032080101114108032104097099107101114})