in reply to Example of perluniintro

So, I want to hear from monks suggestions, comments or "read this document", anything. I am now reading perlunicode.

What is it you're really after?

I like reading perlunitut: Unicode in Perl and http://perldoc.perl.org/perlpacktut.html#Unicode

Replies are listed 'Best First'.
Re^2: Example of perluniintro
by remiah (Hermit) on Aug 18, 2012 at 04:14 UTC
    Thank you for replay.

    I am looking for confirmation. Whether the author of perluniintro forgets to encode characters to bytes , or I am missing something. What do you think?

      Whether the author of perluniintro forgets to encode characters to bytes , or I am missing something. What do you think?

      I don't think the author forgets something, but I'm not sure what you think the author forgets

      Consider these three lines of output, do you see something wrong with them?

      #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; my $code_point = 0x3042;# HIRAGANA LETTER A aka 12354 my $unicode_string = pack('U*', $code_point); dd 12354 => pack('U*', 12354); dd "UNSIGNED CHARS(W*) ", pack "W*", unpack "U*", $unicode_string.$un +icode_string; dd "UNSIGNED OCTETS(C*) ", unpack "C*", $unicode_string.$unicode_strin +g; __END__ (12354, "\x{3042}") ("UNSIGNED CHARS(W*) ", "\x{3042}\x{3042}") ("UNSIGNED OCTETS(C*) ", 12354, 12354)
        I saw the output...

        C is An unsigned char (octet,8bit) value.
        W An unsigned char value (can be greater than 255).

        So, why "C" values could become greater than 255?

        #unpack "C*", $unicode_string.$unicode_string;
        #("UNSIGNED OCTETS(C*) ", 12354, 12354)

        this seems strange...

        Do you mean my example should use "W" for unpack? If so, Does this make sense? The result is same with my machine. My point is, @bytes is not bytes, it is decimal code points for "HIRAGANA LETTER A".

        $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=unpack("W*", $unicode_string); print join('|', @bytes), "\n"; #==>these are not bytes ,but array + of codepoints $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=map{ sprintf("%X",$_) } unpack("W*", Encode::encode('utf8', +$unicode_string)); print join('|', @bytes), "\n";
        I really should read packtut.
        I am waiting for your replay.

        update:
        I met description of perlunicode:

        " pack("C") and unpack("C") are methods for emulating byte-oriented chr() and ord() on Unicode strings. While these methods reveal the internal encoding of Unicode strings, that is not something one normally needs to care about at all."
        
        so, I think
        # this is wrong @bytes=unpack("C*", $unicode_string); # this is right @byets= unpack("C*", Encode::encode('utf8',$unicode_string));
        doesn't it ?