In Perl 5.8.x, a "string" literal like  \xe9 has a slightly ambiguous nature -- it can end up as either a single byte in a non-unicode context, or as as two-byte utf8 character in a unicode context. I think this is intended as a "transitional" behavior, to make some things easier for folks who were habituated to iso-8859-1.

It so happens that the two byte utf8 value for "\xe9" (a.k.a. é) turns out to be 0xC3 0xA9 -- but don't confuse that with "\x{c3a9}", which represents a completely different unicode code point (U+C3A9, one of the CJK ideograph characters).

If you read enough of perlunicode to understand how utf8 works (look for the section titled "Unicode Encodings"), you can figure out why the 16-bit unicode code point U+00E9 (expressable in perl 5.8 as just "\xe9") turns out to be the two-byte binary sequence 0xC3 0xA9 when it's encoded as utf8 -- but hex-numeric literals in strings and regexes are supposed to express 16-bit code points. Note the following:

perl -e '$x="\xe9"; $y="\x{00e9}"; print "\\xe9 eq \\x00e9\n" if ($x e +q $y)' # output is: \xe9 eq \x00e9

update: To give a direct answer to your question:

Why is the latin e letter with acute not getting upgraded to UTF-8 ?
Actually, the letter is being upgraded to utf8; you were just comparing it to the wrong literal value.

And in case you are trying to print the value '\xe9' to a file handle as utf8 data, you must first set the file handle to utf8 mode -- e.g.:

perl -e 'binmode STDOUT, ":utf8"; print "\xe9"' | xxd # output is: 0000000: c3a9 ..

In reply to Re: utf8::upgrade weirdness by graff
in thread utf8::upgrade weirdness by tbusch

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.