utf8::upgrade weirdness

tbusch has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I'm using perl 5.8.6 and the following programm

#!/usr/bin/perl

use strict ;

my $string = "cl\xe9ment";

utf8::upgrade($string);

if (utf8::is_utf8($string)) {
  print "is utf8\n";
}

if (utf8::valid($string)) {
  print "is valid utf8\n";
}

if ($string =~ m/\xe9/) {
  print "match \\xE9\n";
}

if ($string =~ m/\x{c3a9}/) {
  print "match \\xC3A9\n";
}
[download]

yields

is utf8
is valid utf8
match \xE9
[download]

instead of

is utf8
is valid utf8
match \xC3E9
[download]

Is this a bug ? Why is the latin e letter with acute not getting upgraded to UTF-8 ?

Comment on utf8::upgrade weirdness Select or Download Code

Replies are listed 'Best First'.
Re: utf8::upgrade weirdness by ysth (Canon) on Aug 08, 2006 at 17:55 UTC
Note that utf8::valid is an internal method, and shouldn't be needed or useful in production code. \x{c3a9} is not a valid unicode codepoint; I think you meant \xc3\xa9. But even that won't match, because perl still treats the string as a sequence of characters, the third of which is the unicode code point 00E9. If you want to create string where each character is a byte of a utf8-encoded string, you want to be using Encode, not the utf8 functions: `$string = encode("utf8", $string);` [download] This should do exactly the same thing whether you've done utf8::upgrade($string) or not.	[reply] [d/l]
Re^2: utf8::upgrade weirdness by graff (Chancellor) on Aug 09, 2006 at 03:24 UTC
Actually, "\x{c3a9}" is a valid code point. You can look it up.	[reply]
Re^3: utf8::upgrade weirdness by ysth (Canon) on Aug 09, 2006 at 17:32 UTC
Oops, I just looked in perl's unicore/UnicodeData.txt for an exact match, but that only has `AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;; D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;` [download] Thanks for the correction.	[reply] [d/l]
Re: utf8::upgrade weirdness by graff (Chancellor) on Aug 09, 2006 at 03:11 UTC
In Perl 5.8.x, a "string" literal like `\xe9` has a slightly ambiguous nature -- it can end up as either a single byte in a non-unicode context, or as as two-byte utf8 character in a unicode context. I think this is intended as a "transitional" behavior, to make some things easier for folks who were habituated to iso-8859-1. It so happens that the two byte utf8 value for "\xe9" (a.k.a. é) turns out to be 0xC3 0xA9 -- but don't confuse that with "\x{c3a9}", which represents a completely different unicode code point (U+C3A9, one of the CJK ideograph characters). If you read enough of perlunicode to understand how utf8 works (look for the section titled "Unicode Encodings"), you can figure out why the 16-bit unicode code point U+00E9 (expressable in perl 5.8 as just "\xe9") turns out to be the two-byte binary sequence 0xC3 0xA9 when it's encoded as utf8 -- but hex-numeric literals in strings and regexes are supposed to express 16-bit code points. Note the following: `perl -e '$x="\xe9"; $y="\x{00e9}"; print "\\xe9 eq \\x00e9\n" if ($x e +q $y)' # output is: \xe9 eq \x00e9` [download] update: To give a direct answer to your question: Why is the latin e letter with acute not getting upgraded to UTF-8 ? Actually, the letter is being upgraded to utf8; you were just comparing it to the wrong literal value. And in case you are trying to print the value '\xe9' to a file handle as utf8 data, you must first set the file handle to utf8 mode -- e.g.: `perl -e 'binmode STDOUT, ":utf8"; print "\xe9"' \| xxd # output is: 0000000: c3a9 ..` [download]	[reply] [d/l] [select]