shawnhcorey has asked for the wisdom of the Perl Monks concerning the following question:

I thought I knew about how to deal with UTF-8 characters but I can't seem to get things right. Program:
#!/usr/bin/env perl use strict; use warnings; show( 'Resume' ); show( 'Résumé' ); sub show { my $s = shift @_; print "The string is: '$s'\n\t"; for ( split //, $s ){ printf "%02X ", ord( $_ ); } print "\n\n"; my $t = uc( $s ); print "\tUppercase: '$t'\n"; $t = lc( $t ); print "\tLowercase: '$t'\n"; printf "\tlength = %d\n", length( $s ); { use bytes; printf "\tbytes = %d\n", length( $s ); } print "\n"; } __END__
Output:
The string is: 'Resume' 52 65 73 75 6D 65 Uppercase: 'RESUME' Lowercase: 'resume' length = 6 bytes = 6 The string is: 'Résumé' 52 C3 A9 73 75 6D C3 A9 Uppercase: 'RéSUMé' Lowercase: 'résumé' length = 8 bytes = 8
I needed to know the number of characters, the number of bytes, and how to convert to uppercase and lowercase. Version:
$ perl -v This is perl, v5.10.0 built for ppc-linux Copyright 1987-2007, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using "man perl" or "perldoc perl". If you have access to + the Internet, point your browser at http://www.perl.org/, the Perl Home Pa +ge.

Replies are listed 'Best First'.
Re: Help with Accented Characters
by ysth (Canon) on Feb 24, 2008 at 21:44 UTC
    Two things: if you have literal utf8 in your source and you want perl to recognize it as such, use utf8;. Once perl knows your résumé is utf8, it needs to be told that output to stdout may be utf8: binmode STDOUT, ":utf8"; (unless you want output to be downgraded.) (Also possible with the open pragma or the -C switch.)
      I thought Perl 5.8+ automatically recognized UTF-8. I thought `use utf8;` was only used in Perl 5.6+. I'm not sure this is right. I added it to my program and got:
      The string is: 'Resume' 52 65 73 75 6D 65 Uppercase: 'RESUME' Lowercase: 'resume' length = 6 bytes = 6 The string is: 'R�sum 52 E9 73 75 6D E9 Uppercase: 'R�SUM�' Lowercase: 'r�sum length = 6 bytes = 8
      Note that the UTF-8 character C3 A9 got converted to E9. This is not what I want.
        No, you need utf8 to have perl understand literal utf8 in your source, even in 5.8. (I think you can use a BOM instead, but I don't know for sure.)

        The character is U+00E9. Perl may choose to store it internally as C3 A9 (flagged as utf8) or just E9 (not flagged as utf8, and assuming appropriate locale). Can you clarify what you mean by "This is not what I want."? Perl certainly didn't output that "65533" that you show. May I suggest you change your test to show you what is in various variables using Data::Dumper (preferably with $Data::Dumper::Useqq=1)?

        A reply falls below the community's threshold of quality. You may see it by logging in.

        To get a proper hexdump, use unpack instead of ord:

        print join " ", unpack("(H2)*", $s);

        When making that change to your code (in addition to use utf8; — as ysth correctly pointed out already), I'm getting output as I would expect (presuming the source file has been composed with a UTF8 editor).

        $ ./669879.pl The string is: 'Resume' 52 65 73 75 6d 65 Uppercase: 'RESUME' Lowercase: 'resume' length = 6 bytes = 6 The string is: 'Résumé' 52 c3 a9 73 75 6d c3 a9 Uppercase: 'RÉSUMÉ' Lowercase: 'résumé' length = 6 bytes = 8

        (I've converted the é/É chars in the output to Isolatin, for the PM web frontend to display them properly... But as the hexdump shows, they're internally encoded as c3 a9 (UTF8))

        A reply falls below the community's threshold of quality. You may see it by logging in.
        First, what is wrong with the original output? I see that the accented characters did not get their case converted. Is that the only problem? I see the accented letters in your output just fine, so I think that UTF-8 input and output is working OK.

        The Unicode character is code point U+00E9, "Latin small letter e with acute". That is an integer, in the abstract mathematical sense. The character is E9. If you split the string into characters and print the ordinal of each one, E9 is what you will get.

        When you encode the character as a sequence of bytes using UTF-8, the character U+00E9 will be encoded as two bytes, C3 A9. But Perl hides this from you. If the string is holding characters (as opposed to holding bytes) the implementation details will include the fact that those two bytes are in memory, but splitting into characters will include both bytes in one such character, and ord will know how to turn that into an integer.

        Actually, the Perl docs confuse the meaning of character and code point. The above doesn't consider that a single grapheme might be composed of several code points, such as a base letter A followed by a modifier "acute accent above". Now your new output: � is HTML encoding for "Replacement Character", normally shown as a diamond with a question mark inside. This means that with UTF-8 enabled, which turned on the Unicode version of uc and lc, it did not know how to convert é so used this as the error replacement. I don't know why you are missin the final character in two of the lines; perhaps a cut and paste problem?

        Capitalization, in general, is language specific. I agree that a generic routine should convert é to É. Only if it knows you are writing French, where capital letters don't have their marks shown, would it map é to E. I don't know enough about the implementation to tell you why the function failed when use utf8 was used.

        —John

Re: Help with Accented Characters
by Anonymous Monk on Feb 26, 2008 at 18:29 UTC
    Shawn,

    Check your manners! You, to my mind have insulted the very helpful and thoughtful, and even patient people who offered responses to a question you could just as well have figured out yourself.

    But, no. You post here, without thanks or grace you expect not only clarification, but others to divine your intent and then produce your perfect answer.

    Get some manners.