Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Problem upper-casing characters above 0xFF. The following code snippet only changes every other character:
my $aaa = "aáeéiíoóuúoöuü"; my $bbb = uc $aaa; print "$aaa\n"; print "$bbb\n";
Any suggestions? Do I finally need to join battle with Unicode?

Replies are listed 'Best First'.
Re: Upper-casing characters above Hex FF.
by ysth (Canon) on Jan 12, 2004 at 20:39 UTC
    I think what you have there is characters above 0x7f. The difference is critical, as perl will handle characters above 0xff (if utf8 encoded) without further intervention.

    To use your locale settings to determine how to deal with uppercasing, etc. these, try use locale;.

    Otherwise, if you know they are latin1, you can upgrade to utf8 with utf8::upgrade($aaa); $bbb = uc $aaa; print "$bbb\n"; --which will output utf8-encoded data. If you don't want that, add utf8::downgrade($bbb); before the print (but see the doc before using utf8::downgrade).

    If you know they are in some other encoding, specify that with use encoding "whatever"; and then use utf8::*grade as above.

    (Also, note that at least one utf8 character (chr(223), "LATIN SMALL LETTER SHARP S") will produce two characters ("SS") when uppercased.)

Re: Upper-casing characters above Hex FF.
by hardburn (Abbot) on Jan 12, 2004 at 19:51 UTC

    What version of perl are you using? 5.8.0 and beyond should be able to automatically detect unicode and do the right thing.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

Re: Upper-casing characters above Hex FF.
by Not_a_Number (Prior) on Jan 12, 2004 at 20:07 UTC

    For your sample, you could do this:

    my $aaa = "aáeéiíoóuúoöuü"; open my $fh, '>', 'tmp.txt' or die "Can't open file $!"; my @aaa = split //, $aaa; print $fh chr( ord($_) - 32 ) for @aaa;

    Note that I only print to a file because I'm on Windows, which has its own character coding in the CMD window...

    But this solution won't always work (try 'ÿ' for example...)

    So, Unicode is your friend ;-)

    dave

      That won't work too well.
      use open IN  => ":crlf", OUT => ":bytes";
      use open OUT => ':utf8';
      use open IO  => ":encoding(iso-8859-7)";
      
      use open IO  => ':locale';
      
      use open ':utf8';
      use open ':locale';
      use open ':encoding(iso-8859-7)';
      
Re: Upper-casing characters above Hex FF (from originator).
by Anonymous Monk on Jan 13, 2004 at 16:07 UTC
    From Originator:
    First, thanks for the replies.
    Second, yes I did mean above HEX 7F, and not HEX FF.
    Third, I should have mentioned I am using perl v5.8.0, running on Windows 98. Perl runs in DOS mode, so there is an additional level of obscurity in that DOS and Windows have different character sets (Code Pages 850 and 1252 respectively), but I don't think this is my problem.
    Fourth, Windows doesn't have locale as far as I can see, so no solution there.
    Finally, utf8::upgrade seems to do the trick, but I am confused about how utf8::downgrade knows what to downgrade to.