tbusch has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I'm using perl 5.8.6 and the following programm
#!/usr/bin/perl use strict ; my $string = "cl\xe9ment"; utf8::upgrade($string); if (utf8::is_utf8($string)) { print "is utf8\n"; } if (utf8::valid($string)) { print "is valid utf8\n"; } if ($string =~ m/\xe9/) { print "match \\xE9\n"; } if ($string =~ m/\x{c3a9}/) { print "match \\xC3A9\n"; }
yields
is utf8 is valid utf8 match \xE9
instead of
is utf8 is valid utf8 match \xC3E9
Is this a bug ? Why is the latin e letter with acute not getting upgraded to UTF-8 ?

Replies are listed 'Best First'.
Re: utf8::upgrade weirdness
by ysth (Canon) on Aug 08, 2006 at 17:55 UTC
    Note that utf8::valid is an internal method, and shouldn't be needed or useful in production code.

    \x{c3a9} is not a valid unicode codepoint; I think you meant \xc3\xa9. But even that won't match, because perl still treats the string as a sequence of characters, the third of which is the unicode code point 00E9. If you want to create string where each character is a byte of a utf8-encoded string, you want to be using Encode, not the utf8 functions:

    $string = encode("utf8", $string);
    This should do exactly the same thing whether you've done utf8::upgrade($string) or not.
      Actually, "\x{c3a9}" is a valid code point. You can look it up.
        Oops, I just looked in perl's unicore/UnicodeData.txt for an exact match, but that only has
        AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;; D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
        Thanks for the correction.
Re: utf8::upgrade weirdness
by graff (Chancellor) on Aug 09, 2006 at 03:11 UTC
    In Perl 5.8.x, a "string" literal like  \xe9 has a slightly ambiguous nature -- it can end up as either a single byte in a non-unicode context, or as as two-byte utf8 character in a unicode context. I think this is intended as a "transitional" behavior, to make some things easier for folks who were habituated to iso-8859-1.

    It so happens that the two byte utf8 value for "\xe9" (a.k.a. é) turns out to be 0xC3 0xA9 -- but don't confuse that with "\x{c3a9}", which represents a completely different unicode code point (U+C3A9, one of the CJK ideograph characters).

    If you read enough of perlunicode to understand how utf8 works (look for the section titled "Unicode Encodings"), you can figure out why the 16-bit unicode code point U+00E9 (expressable in perl 5.8 as just "\xe9") turns out to be the two-byte binary sequence 0xC3 0xA9 when it's encoded as utf8 -- but hex-numeric literals in strings and regexes are supposed to express 16-bit code points. Note the following:

    perl -e '$x="\xe9"; $y="\x{00e9}"; print "\\xe9 eq \\x00e9\n" if ($x e +q $y)' # output is: \xe9 eq \x00e9

    update: To give a direct answer to your question:

    Why is the latin e letter with acute not getting upgraded to UTF-8 ?
    Actually, the letter is being upgraded to utf8; you were just comparing it to the wrong literal value.

    And in case you are trying to print the value '\xe9' to a file handle as utf8 data, you must first set the file handle to utf8 mode -- e.g.:

    perl -e 'binmode STDOUT, ":utf8"; print "\xe9"' | xxd # output is: 0000000: c3a9 ..