UTF8 - ucfirst() is not working with foreign characters

tyatpi has asked for the wisdom of the Perl Monks concerning the following question:

Good evening Monks, I have wrestled with the following subroutine but I cannot understand why the ucfirst method does not work. In a file call util.pm, there is a method:

sub beautify {
    my ($in) = @_;
    my $tmp;
    foreach (split(/\s/o, lc($in))){
        $tmp .= ucfirst($_);
        $tmp .= ' ';
    }
    $tmp =~ s/\s$//;
    return($tmp);
}
[download]

And now I test it with:

use util;

use Test::More tests => 34;

is( &util::beautify( "àisTheWord"), "Àistheword", "àisTheWord - specia
+l character changes case." );
...
is( &util::beautify( "ùisTheWord"), "Ùistheword", "ùisTheWord - specia
+l character changes case." );
is( &util::beautify( "ûisTheWord"), "Ûistheword", "ûisTheWord - specia
+l character changes case." );
is( &util::beautify( "üisTheWord"), "Üistheword", "üisTheWord - specia
+l character changes case." );
is( &util::beautify( "ÿisTheWord"), "Ÿisthejword", "ÿisTheWord - speci
+al character changes case." );
...
[download]

They are all failing. I've tried with:

use utf8;
use Encode;
[download]

and all subsets of that pair, but to no avail. The system is perl v5.8.8, and when I run the tests on my v5.10 home laptop without any modifications to the original util.pm file, all tests pass! Any insights on how to debug would be very very helpful.

Comment on UTF8 - ucfirst() is not working with foreign characters Select or Download Code

Replies are listed 'Best First'.
Re: UTF8 - ucfirst() is not working with foreign characters by graff (Chancellor) on Jan 11, 2012 at 06:28 UTC
I was able to test on 5.8.7 (an old freebsd box), as well as 5.10.1 (linux) and 5.12.3 (macosx), and all three behaved the same -- including a problem in the output from `Test::More::is()` (I don't know why unicode characters get converted to "?"). Here's my "util.pm": `use strict; package util; sub beautify { my ($in) = @_; my $tmp; foreach (split(/\s/o, lc($in))){ $tmp .= ucfirst($_); $tmp .= ' '; } $tmp =~ s/\s$//; return($tmp); } 1;` [download] Here's my test code: use utf8; use util; use Test::More tests => 4; is( &util::beautify( "àisTheWord"), "Àistheword", "àisTheWord - specia +l character changes case." ); is( &util::beautify( "ùisTheWord"), "Ùistheword", "ùisTheWord - specia +l character changes case." ); is( &util::beautify( "üisTheWord"), "Üistheword", "üisTheWord - specia +l character changes case." ); is( &util::beautify( "ÿisTheWord"), "Ÿistheword", "ÿisTheWord - specia +l character changes case." ); print util::beautify( "ÿisTheWord") . "\n"; [download] And here's the output I got on all three versions: `1..4 ok 1 - ?isTheWord - special character changes case. ok 2 - ?isTheWord - special character changes case. ok 3 - ?isTheWord - special character changes case. ok 4 - ?isTheWord - special character changes case. Å¸istheword` [download] (I put in that last print statement just to check that it would correctly show what was expected on a utf8-capable terminal. I also left out the "u with circumflex because for some reason I couldn't post its upper-case form correctly -- very strange.) UPDATE: I should add that the command line I used was "perl -C31 test-util.t" (and it seems "-CS" would do the same thing.) Anyway, the tests do pass for me, despite the test message getting munged. So what are you doing that's different? SECOND UPDATE: I added `use Test::More::UTF8` as per mje's suggestion below, and that fixed the output messages -- thanks++!!	[reply] [d/l] [select]
Re^2: UTF8 - ucfirst() is not working with foreign characters by mje (Curate) on Jan 11, 2012 at 09:25 UTC
Test::More::UTF8 will fix the test message so long as your terminal is UTF8.	[reply]
Re: UTF8 - ucfirst() is not working with foreign characters by 3dbc (Monk) on Jan 11, 2012 at 04:07 UTC
Please consult uc/lc with extended ASCII	[reply]
Re: UTF8 - ucfirst() is not working with foreign characters by Khen1950fx (Canon) on Jan 11, 2012 at 06:23 UTC
This might solve the problem for you. #!/usr/bin/perl -slw use strict; use warnings; use Encode; use Test::utf8; use Test::More tests => 9; my $str1 = "\340isTheWord - special character changes case."; my $str2 = "\300istheword - special character changes case."; my($str3)= ($str1, $str2, "$str1/$str2 - special character changes case" ); my $num_octets1 = utf8::upgrade($str1); my $num_octets2 = utf8::upgrade($str2); my $num_octets3 = utf8::upgrade($str3); is_valid_string( $str1 ); is_valid_string( $str2 ); is_valid_string( $str3 ); is_sane_utf8( $str1 ); is_sane_utf8( $str2 ); is_sane_utf8( $str3 ); is_within_latin_1( $str1 ); is_within_latin_1( $str2 ); is_within_latin_1( $str3 ); binmode STDOUT, ":encoding(UTF-8)"; print beautify($str3); sub beautify { my ($in) = @_; my $tmp; foreach $_ ( split( /\s/o, lc($in), 1 ) ) { $tmp .= ucfirst; $tmp .= ' '; } $tmp =~ s/\s$//; return $tmp; } [download]	[reply] [d/l]