Determining the length of a unicode string without LC

ewaters has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I'm using Text::CharWidth (recommended from http://www.perlmonks.org/?node_id=713326) to determine the display width of UTF-8 strings in a module I just released (https://metacpan.org/release/Text-UnicodeBox). I'm running into an issue with *BSD boxes and I believe I've narrowed it down to the environment. Take this simple test script:

use Test::More;
use Text::CharWidth qw(:all);

my $string = " \x{8c61}\x{5f62}\x{6587}\x{5b57}\x{8c61}\x{5f62}\x{6587
+}\x{5b57} ";
ok utf8::valid($string), "is valid utf8";
is length($string), 10, "There are 10 bytes";
is mbswidth($string), 18, "CharWidth of kanji";

done_testing;
[download]

This works just fine so long as your LC_* environment variables are configured. The issue, though, is that apparently all the CPAN testers out there don't have the locale set when they run this, which breaks CharWidth (Text::CharWidth uses the system function wcwidth, which uses your locale settings - http://www.manpagez.com/man/3/wcwidth/). Here's an example of it breaking:

$ LC_ALL= prove -l -v t/simple_unicode_test.t 
t/simple_unicode_test.t .. 
ok 1 - is valid utf8
ok 2 - There are 10 bytes
not ok 3 - CharWidth of kanji
1..3

#   Failed test 'CharWidth of kanji'
#   at t/simple_unicode_test.t line 9.
#          got: '-22'
#     expected: '18'
# Looks like you failed 1 test of 3.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/3 subtests 

Test Summary Report
-------------------
t/simple_unicode_test.t (Wstat: 256 Tests: 3 Failed: 1)
  Failed test:  3
  Non-zero exit status: 1
Files=1, Tests=3,  0 wallclock secs ( 0.02 usr  0.03 sys +  0.03 cusr 
+ 0.01 csys =  0.09 CPU)
Result: FAIL
[download]

This is simplified, of course. Here's one of the actual failing test reports

And another example (from http://www.perlmonks.org/?node_id=713326) of how LC_ALL= can mess up CharWidth:

$ perl -C63 -MDevel::Peek -Mutf8 -mText::CharWidth=mbswidth -le '$_="(
+\x{5fcd} Guimarăes)"; Dump($_); print mbswidth($_);'

SV = PV(0x8101064) at 0x8100118
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x814b740 "(\345\277\215 Guimar\303\243es)"\0 [UTF8 "(\x{5fcd} 
+Guimar\x{e3}es)"]
  CUR = 16
  LEN = 20
14

$ LC_ALL= perl -C63 -MDevel::Peek -Mutf8 -mText::CharWidth=mbswidth -l
+e '$_="(\x{5fcd} Guiarăes)"; Dump($_); print mbswidth($_);'

SV = PV(0x8101064) at 0x8100118
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x814b740 "(\345\277\215 Guimar\303\243es)"\0 [UTF8 "(\x{5fcd} 
+Guimar\x{e3}es)"]
  CUR = 16
  LEN = 20
6
[download]

So, how do I get my code to pass CPAN tests on *BSD boxes without LC_* set? I've tried adding it to the tests but to no avail.

Comment on Determining the length of a unicode string without LC_* set Select or Download Code

Replies are listed 'Best First'.
Re: Determining the length of a unicode string without LC_* set by sauoq (Abbot) on May 08, 2012 at 04:15 UTC
So, how do I get my code to pass CPAN tests on BSD boxes without LC_* set?* Set them. `$ENV{LC_ALL} = 'en_US.utf8'; # Or whatever.` `-sauoq "My two cents aren't worth a dime.";`	[reply] [d/l]
Re: Determining the length of a unicode string without LC_* set by ikegami (Patriarch) on May 08, 2012 at 03:57 UTC
Why would you want your tests to pass on a system where your code doesn't work?	[reply]
Re^2: Determining the length of a unicode string without LC_* set by sauoq (Abbot) on May 08, 2012 at 04:21 UTC
There's a line somewhere between a system and its configuration. He can either fix his code to do something reasonable when the envariables he needs aren't set right, or he can fix his tests and document that the envariables need to be set right. `-sauoq "My two cents aren't worth a dime.";`	[reply]
Re: Determining the length of a unicode string without LC_* set by raybies (Chaplain) on May 08, 2012 at 12:07 UTC
This may be a pointless aside, but when you say "There are 10 bytes" don't you mean that the string has ten utf characters? I don't know utf, but those hex codes (\x{8c61}) are sixteen bits or two bytes each, right? (Just curious, as I've never worked with utf-8...)	[reply]