So, the "quasi-ambiguous" nature of bytes/characters in the \x80-\xff (U0080-U00FF) range seems deeper, subtler, more strange than I expected: for this set, perl's internal representation is still single-byte, not really utf8.

It may be either single byte or UTF8, depending on your environment (pragmas). This is NO PROBLEM if you properly decode all your input, and encode all your output. This is not a bug, but a feature that is much needed for backwards compatibility with old code.

But if it were set to ":utf8" before the first print statment, the two outputs would again be different, but in a different way, and the first one would be "wrong":

Before the "_utf8_on", which I stress is a BAD IDEA, the string is latin-1. It's converted to UTF-8 as the binmode requested: C3 becomes C3 83 and A9 becomes C2 A9, etcetera. With the "_utf8_on" you tell Perl that, no, it's not latin-1, but UTF-8. And since that matches the output encoding, Perl no longer has any need to convert anything.

In other words, first the string is "résumé\n", which when printed is encoded into UTF-8 as 72 C3 83 C2 A9 73 75 6d C3 83 C2 A9 0A, then someone messes with the internals and all of a sudden the string is "résumé\n", already UTF-8 encoded as 72 C3 A9 73 75 6d C3 A9 0A. (Two digits per byte, one underline per character)

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }


In reply to Re^4: Interventionist Unicode Behaviors by Juerd
in thread Interventionist Unicode Behaviors by creamygoodness

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.