I read perluniintro and tried the examples of "How Do I Convert Binary Data Into a Particular Encoding, Or Vi ce Versa?" with multibyte character. And I think the author confusing for "native string" or "bytes" ... I think. It will make sense with letters less than 128 code point, but when I tried with a letter like 'HIRAGANA LETTER A', it doesn't make sense. In short, examples seems to me, it is forgetting "encoding to bytes".

"A" is 0x41 for bytes and 0x41 for code point.
"HIRAGANA LETTER A is 0xe3,0x81,0x82 for bytes and 0x3042 for codepoint.

#hex dump of A #00000000 41 |A| #00000001
#hex dump of HIRAGANA LETTER A #00000000 e3 81 82 |...| #00000003

And two example codes below.

#Example 1: native string may not be native string #Code: $native_string=pack('W*', unpack('U*', $unicode_string)); use strict; use warnings; use Encode qw(encode); use Devel::Peek; use 5.012; my($code_point,$unicode_string,$native_string, $native_string2); $code_point=0x41;#"A"; $unicode_string=pack('U*', $code_point); $native_string=pack('W*', unpack('U*', $unicode_string)); Dump $unicode_string; Dump $native_string; # ==> here it is not UTF-8 flagged $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); $native_string=pack('W*', unpack('U*', $unicode_string)); $native_string2=Encode::encode('utf8', $unicode_string); Dump $unicode_string; Dump $native_string; # ==> this is UTF8 flaged may be transparen +tly upgraded because code point > 255 Dump $native_string2;
Devel::Peek shows $native_string is UTF8 flagged and $native_string2 is not UTF-8 flagged in case of HIRAGANA LETTER A.

#Example 2: it is not bytes, it is array of code point. #Code: @bytes=unpack("C*", $unicode_string); use strict; use warnings; use Encode qw(encode); use 5.012; my($code_point,$unicode_string,@bytes); $code_point=0x41;#A $unicode_string=pack('U*', $code_point); @bytes=unpack("C*", $unicode_string); print join('|', @bytes), "\n"; $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=unpack("C*", $unicode_string); print join('|', @bytes), "\n"; #==>these are not bytes ,but array + of codepoints $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=map{ sprintf("%X",$_) } unpack("C*", Encode::encode('utf8', +$unicode_string)); print join('|', @bytes), "\n";

So, I want to hear from monks suggestions, comments or "read this document", anything. I am now reading perlunicode.

regards.


In reply to Example of perluniintro by remiah

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.