ewaters has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I'm using Text::CharWidth (recommended from http://www.perlmonks.org/?node_id=713326) to determine the display width of UTF-8 strings in a module I just released (https://metacpan.org/release/Text-UnicodeBox). I'm running into an issue with *BSD boxes and I believe I've narrowed it down to the environment. Take this simple test script:

use Test::More; use Text::CharWidth qw(:all); my $string = " \x{8c61}\x{5f62}\x{6587}\x{5b57}\x{8c61}\x{5f62}\x{6587 +}\x{5b57} "; ok utf8::valid($string), "is valid utf8"; is length($string), 10, "There are 10 bytes"; is mbswidth($string), 18, "CharWidth of kanji"; done_testing;

This works just fine so long as your LC_* environment variables are configured. The issue, though, is that apparently all the CPAN testers out there don't have the locale set when they run this, which breaks CharWidth (Text::CharWidth uses the system function wcwidth, which uses your locale settings - http://www.manpagez.com/man/3/wcwidth/). Here's an example of it breaking:

$ LC_ALL= prove -l -v t/simple_unicode_test.t t/simple_unicode_test.t .. ok 1 - is valid utf8 ok 2 - There are 10 bytes not ok 3 - CharWidth of kanji 1..3 # Failed test 'CharWidth of kanji' # at t/simple_unicode_test.t line 9. # got: '-22' # expected: '18' # Looks like you failed 1 test of 3. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/3 subtests Test Summary Report ------------------- t/simple_unicode_test.t (Wstat: 256 Tests: 3 Failed: 1) Failed test: 3 Non-zero exit status: 1 Files=1, Tests=3, 0 wallclock secs ( 0.02 usr 0.03 sys + 0.03 cusr + 0.01 csys = 0.09 CPU) Result: FAIL

This is simplified, of course. Here's one of the actual failing test reports

And another example (from http://www.perlmonks.org/?node_id=713326) of how LC_ALL= can mess up CharWidth:

$ perl -C63 -MDevel::Peek -Mutf8 -mText::CharWidth=mbswidth -le '$_="( +\x{5fcd} Guimarães)"; Dump($_); print mbswidth($_);' SV = PV(0x8101064) at 0x8100118 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x814b740 "(\345\277\215 Guimar\303\243es)"\0 [UTF8 "(\x{5fcd} +Guimar\x{e3}es)"] CUR = 16 LEN = 20 14 $ LC_ALL= perl -C63 -MDevel::Peek -Mutf8 -mText::CharWidth=mbswidth -l +e '$_="(\x{5fcd} Guiarães)"; Dump($_); print mbswidth($_);' SV = PV(0x8101064) at 0x8100118 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x814b740 "(\345\277\215 Guimar\303\243es)"\0 [UTF8 "(\x{5fcd} +Guimar\x{e3}es)"] CUR = 16 LEN = 20 6

So, how do I get my code to pass CPAN tests on *BSD boxes without LC_* set? I've tried adding it to the tests but to no avail.

Replies are listed 'Best First'.
Re: Determining the length of a unicode string without LC_* set
by sauoq (Abbot) on May 08, 2012 at 04:15 UTC
    So, how do I get my code to pass CPAN tests on *BSD boxes without LC_* set?

    Set them.

    $ENV{LC_ALL} = 'en_US.utf8'; # Or whatever.

    -sauoq
    "My two cents aren't worth a dime.";
Re: Determining the length of a unicode string without LC_* set
by ikegami (Patriarch) on May 08, 2012 at 03:57 UTC
    Why would you want your tests to pass on a system where your code doesn't work?

      There's a line somewhere between a system and its configuration. He can either fix his code to do something reasonable when the envariables he needs aren't set right, or he can fix his tests and document that the envariables need to be set right.

      -sauoq
      "My two cents aren't worth a dime.";
Re: Determining the length of a unicode string without LC_* set
by raybies (Chaplain) on May 08, 2012 at 12:07 UTC
    This may be a pointless aside, but when you say "There are 10 bytes" don't you mean that the string has ten utf characters? I don't know utf, but those hex codes (\x{8c61}) are sixteen bits or two bytes each, right? (Just curious, as I've never worked with utf-8...)