Re: utf8::upgrade weirdness

In Perl 5.8.x, a "string" literal like \xe9 has a slightly ambiguous nature -- it can end up as either a single byte in a non-unicode context, or as as two-byte utf8 character in a unicode context. I think this is intended as a "transitional" behavior, to make some things easier for folks who were habituated to iso-8859-1.

It so happens that the two byte utf8 value for "\xe9" (a.k.a. é) turns out to be 0xC3 0xA9 -- but don't confuse that with "\x{c3a9}", which represents a completely different unicode code point (U+C3A9, one of the CJK ideograph characters).

If you read enough of perlunicode to understand how utf8 works (look for the section titled "Unicode Encodings"), you can figure out why the 16-bit unicode code point U+00E9 (expressable in perl 5.8 as just "\xe9") turns out to be the two-byte binary sequence 0xC3 0xA9 when it's encoded as utf8 -- but hex-numeric literals in strings and regexes are supposed to express 16-bit code points. Note the following:

perl -e '$x="\xe9"; $y="\x{00e9}"; print "\\xe9 eq \\x00e9\n" if ($x e
+q $y)'

# output is:
\xe9 eq \x00e9
[download]

update: To give a direct answer to your question:

Why is the latin e letter with acute not getting upgraded to UTF-8 ?

Actually, the letter is being upgraded to utf8; you were just comparing it to the wrong literal value.

And in case you are trying to print the value '\xe9' to a file handle as utf8 data, you must first set the file handle to utf8 mode -- e.g.:

perl -e 'binmode STDOUT, ":utf8"; print "\xe9"' | xxd

# output is:
0000000: c3a9                                     ..
[download]

Comment on Re: utf8::upgrade weirdness Select or Download Code