Hmm, this doesn't make sense to me: AFAIK Perl strings never store code points, but rather store the UTF-8 encoding of the code points e.g. the string with a Greek uppercase Kappa, whose code point is 039A: $str = "\x{039A}"; does not contain, in hex, 039A, but rather in hex, CE9A, the UTF8 encoding of that code point. /
You are getting diverted by how perl happens internally to store a string. This is almost always completely irrelevant, may change between perl versions, and is just confusing you. $str above is a perl string that contains one character, and ord(that_character) is 0x39a. Whether perl happens to remember that fact by storing the two bytes 0xCE and 0x9A somewhere in memory shouldn't normally concern you.
What your example seems to demonstrate, AFAICS, is the character v. byte o/p of length, when presented with strings where the UTF-8 flag is switched on/off. /
Again, forget the internals, don't worry about the internal UTF8 flag. The length function always returns the number of characters in a string, not the number of bytes.
So for the final string, containing alpha, beta, gamma, and delta, it has a length of 4 characters, when Perl knows that it contains valid UTF-8, but a length of 8 when Perl is assuming the old byte=character semantics. However, both the strings are byte-for-byte identical. /
I don't understand what you are trying to say there.

Perhaps it would help if you viewed the encode_utf8() function as being equivalent the one I include below. Does that make things any clearer? Note that my perl version of this function knows nothing about the internal representation of its arg, or whether it has its UTF8 flag set etc.

sub encode_utf8 { my $e; for (map ord, split //, $_[0]) { if ($_ < 128) { $e .= chr($_); } elsif ($_ < 1024) { $e .= chr(0xC0 + ($_ >> 6)); $e .= chr(0x80 + ($_ & 63)); } elsif (...) ... } } return $e; }

Dave.


In reply to Re^5: What does Encode::encode_utf8 do to UTF-8 data ? by dave_the_m
in thread What does Encode::encode_utf8 do to UTF-8 data ? by scollyer

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.