TowerGuard has asked for the wisdom of the Perl Monks concerning the following question:

I've encountered an issue in my perl script where the locale of the box I'm running on affects the length() calculation of strings with multibyte characters. The current locale setting of the box is en_US.UTF-8. If I "export LC_CTYPE=en_US" at the UNIX prompt and then run the script, the length() function does what I want it to. However, I want the script to set the locale dynamically. I found the following code from a site online but none of this seems to have the effect that it should. For instance I use:
use locale; use POSIX qw(locale_h); $oldlocale = setlocale(LC_CTYPE); $newlocale = setlocale(LC_CTYPE, "en_US");
The setlocale function seems to set the locale correctly as when I print $oldlocale and $newlocale I get en_US.UTF-8 and en_US respectively.

However my length function still treats the multibyte characters in an undesired way.

How come when I change the locale at the UNIX prompt it works as desired but when I change it in the script, it does not?

Thanks,
Steve

Replies are listed 'Best First'.
Re: setlocale not working properly in perl script
by ikegami (Patriarch) on Jun 12, 2008 at 16:18 UTC
    I don't see this effect you mentioned.
    $ LC_CTYPE='en_US.UTF-8' perl -Mlocale -le'my $ch = chr(0x2660); print + length($ch);' 1 $ LC_CTYPE='en_US' perl -Mlocale -le'my $ch = chr(0x2660); print + length($ch);' 1 $ LC_CTYPE='en_US.UTF-8' perl -Mlocale -MEncode -le'my $ch = chr(0x266 +0); my $bytes = encode("utf-8", $ch); print length($bytes);' 3 $ LC_CTYPE='en_US' perl -Mlocale -MEncode -le'my $ch = chr(0x266 +0); my $bytes = encode("utf-8", $ch); print length($bytes);' 3

    Could you provide a suitable test? I'm not all that familiar with locales.

Re: setlocale not working properly in perl script
by ikegami (Patriarch) on Jun 12, 2008 at 15:54 UTC

    In neither the documentation for length or in perllocale is LC_CTYPE documented to affect length.

    What are you trying to do? What's the desired way that length should handle your characters?

      The length definition is to return the length of characters in a string and not neccessarily the length in bytes (ie. when dealing with multibyte character strings). The locale environment variable affects how the length function views the string. In en_US.UTF-8 the length function treats each multibyte character as 1 character and returns 1 when indeed this character takes up 3 bytes. In en_US the length function treats each byte as a separate character so it returns 3. I want the latter of these results (to get the number of bytes in the string). I am trying to play around with bytes::length to do effectively the same thing regardless of locale but I still find it strange that the setlocale function doesn't work when done in the perl script.

        The locale environment variable affects how the length function views the string.

        I don't see where this is documented, I have been unable to reproduce this, and you didn't respond to my request to demonstrate this. I have nothing to go on.

        On the other hand, I am knowledgeable how Perl handles strings absent use locale;. From this point on in this post, I presume a lack of use locale;.

        The UTF8 flag (and only the UTF8 flag) affects how length views a string. In the following, notice how length returns different lengths even though the variables on which they act have the same internal representation ("\303\251")? The difference is that the first has the UTF8 flag on.

        $ perl -MDevel::Peek -le'utf8::upgrade(my $x = chr(0xE9)); print lengt +h($x); Dump($x)' 1 SV = PVMG(0x817fd40) at 0x814cc6c REFCNT = 1 FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x815bda0 "\303\251"\0 [UTF8 "\x{e9}"] CUR = 2 LEN = 3 MAGIC = 0x8163060 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 1 $ perl -MDevel::Peek -le'utf8::encode(my $x = chr(0xE9)); print length +($x); Dump($x)' 2 SV = PV(0x814ce90) at 0x814cc6c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x815bda0 "\303\251"\0 CUR = 2 LEN = 3

        The string is represented internally as UTF8 (UTF8 flag on) when bytes are decoded into characters. This can be done on scalars (utf8::decode or Encode::decode) or on file handles (using the :encoding PerlIO layer).

Re: setlocale not working properly in perl script
by TowerGuard (Initiate) on Jun 12, 2008 at 18:30 UTC
    The length definition is to return the length of characters in a string and not neccessarily the length in bytes (ie. when dealing with multibyte character strings). The locale environment variable affects how the length function views the string. In en_US.UTF-8 the length function treats each multibyte character as 1 character and returns 1 when indeed this character takes up 3 bytes. In en_US the length function treats each byte as a separate character so it returns 3. I want the latter of these results (to get the number of bytes in the string). I am trying to play around with bytes::length to do effectively the same thing regardless of locale but I still find it strange that the setlocale function doesn't work when done in the perl script.