tyatpi has asked for the wisdom of the Perl Monks concerning the following question:

Good evening Monks, I have wrestled with the following subroutine but I cannot understand why the ucfirst method does not work. In a file call util.pm, there is a method:
sub beautify { my ($in) = @_; my $tmp; foreach (split(/\s/o, lc($in))){ $tmp .= ucfirst($_); $tmp .= ' '; } $tmp =~ s/\s$//; return($tmp); }
And now I test it with:
use util; use Test::More tests => 34; is( &util::beautify( "àisTheWord"), "Àistheword", "àisTheWord - specia +l character changes case." ); ... is( &util::beautify( "ùisTheWord"), "Ùistheword", "ùisTheWord - specia +l character changes case." ); is( &util::beautify( "ûisTheWord"), "Ûistheword", "ûisTheWord - specia +l character changes case." ); is( &util::beautify( "üisTheWord"), "Üistheword", "üisTheWord - specia +l character changes case." ); is( &util::beautify( "ÿisTheWord"), "Ÿisthejword", "ÿisTheWord - speci +al character changes case." ); ...
They are all failing. I've tried with:
use utf8; use Encode;
and all subsets of that pair, but to no avail. The system is perl v5.8.8, and when I run the tests on my v5.10 home laptop without any modifications to the original util.pm file, all tests pass! Any insights on how to debug would be very very helpful.

Replies are listed 'Best First'.
Re: UTF8 - ucfirst() is not working with foreign characters
by graff (Chancellor) on Jan 11, 2012 at 06:28 UTC
    I was able to test on 5.8.7 (an old freebsd box), as well as 5.10.1 (linux) and 5.12.3 (macosx), and all three behaved the same -- including a problem in the output from Test::More::is() (I don't know why unicode characters get converted to "?").

    Here's my "util.pm":

    use strict; package util; sub beautify { my ($in) = @_; my $tmp; foreach (split(/\s/o, lc($in))){ $tmp .= ucfirst($_); $tmp .= ' '; } $tmp =~ s/\s$//; return($tmp); } 1;
    Here's my test code:
    use utf8; use util; use Test::More tests => 4; is( &util::beautify( "àisTheWord"), "Àistheword", "àisTheWord - specia +l character changes case." ); is( &util::beautify( "ùisTheWord"), "Ùistheword", "ùisTheWord - specia +l character changes case." ); is( &util::beautify( "üisTheWord"), "Üistheword", "üisTheWord - specia +l character changes case." ); is( &util::beautify( "ÿisTheWord"), "Ÿistheword", "ÿisTheWord - specia +l character changes case." ); print util::beautify( "ÿisTheWord") . "\n";
    And here's the output I got on all three versions:
    1..4 ok 1 - ?isTheWord - special character changes case. ok 2 - ?isTheWord - special character changes case. ok 3 - ?isTheWord - special character changes case. ok 4 - ?isTheWord - special character changes case. Ÿistheword
    (I put in that last print statement just to check that it would correctly show what was expected on a utf8-capable terminal. I also left out the "u with circumflex because for some reason I couldn't post its upper-case form correctly -- very strange.)

    UPDATE: I should add that the command line I used was "perl -C31 test-util.t" (and it seems "-CS" would do the same thing.) Anyway, the tests do pass for me, despite the test message getting munged. So what are you doing that's different?

    SECOND UPDATE: I added use Test::More::UTF8 as per mje's suggestion below, and that fixed the output messages -- thanks++!!

Re: UTF8 - ucfirst() is not working with foreign characters
by 3dbc (Monk) on Jan 11, 2012 at 04:07 UTC
Re: UTF8 - ucfirst() is not working with foreign characters
by Khen1950fx (Canon) on Jan 11, 2012 at 06:23 UTC
    This might solve the problem for you.
    #!/usr/bin/perl -slw use strict; use warnings; use Encode; use Test::utf8; use Test::More tests => 9; my $str1 = "\340isTheWord - special character changes case."; my $str2 = "\300istheword - special character changes case."; my($str3)= ($str1, $str2, "$str1/$str2 - special character changes case" ); my $num_octets1 = utf8::upgrade($str1); my $num_octets2 = utf8::upgrade($str2); my $num_octets3 = utf8::upgrade($str3); is_valid_string( $str1 ); is_valid_string( $str2 ); is_valid_string( $str3 ); is_sane_utf8( $str1 ); is_sane_utf8( $str2 ); is_sane_utf8( $str3 ); is_within_latin_1( $str1 ); is_within_latin_1( $str2 ); is_within_latin_1( $str3 ); binmode STDOUT, ":encoding(UTF-8)"; print beautify($str3); sub beautify { my ($in) = @_; my $tmp; foreach $_ ( split( /\s/o, lc($in), 1 ) ) { $tmp .= ucfirst; $tmp .= ' '; } $tmp =~ s/\s$//; return $tmp; }