I saw the output...
C is An unsigned char (octet,8bit) value.
W An unsigned char value (can be greater than 255).
So, why "C" values could become greater than 255?
#unpack "C*", $unicode_string.$unicode_string;
#("UNSIGNED OCTETS(C*) ", 12354, 12354)
this seems strange...
Do you mean my example should use "W" for unpack? If so, Does this make sense? The result is same with my machine. My point is, @bytes is not bytes, it is decimal code points for "HIRAGANA LETTER A".
$code_point=0x3042;#HIRAGANA LETTER A
$unicode_string=pack('U*', $code_point);
@bytes=unpack("W*", $unicode_string);
print join('|', @bytes), "\n"; #==>these are not bytes ,but array
+ of codepoints
$code_point=0x3042;#HIRAGANA LETTER A
$unicode_string=pack('U*', $code_point);
@bytes=map{ sprintf("%X",$_) } unpack("W*", Encode::encode('utf8',
+$unicode_string));
print join('|', @bytes), "\n";
I really should read packtut.
I am waiting for your replay.
update:
I met description of perlunicode:
" pack("C") and unpack("C") are methods for emulating byte-oriented chr() and ord() on Unicode strings. While these methods reveal the internal encoding of Unicode strings, that is not something one normally needs to care about at all."
so, I think
# this is wrong
@bytes=unpack("C*", $unicode_string);
# this is right
@byets= unpack("C*", Encode::encode('utf8',$unicode_string));
doesn't it ?
|