in reply to Re: Example of perluniintro
in thread Example of perluniintro

Thank you for replay.

I am looking for confirmation. Whether the author of perluniintro forgets to encode characters to bytes , or I am missing something. What do you think?

Replies are listed 'Best First'.
Re^3: Example of perluniintro
by Anonymous Monk on Aug 18, 2012 at 04:30 UTC

    Whether the author of perluniintro forgets to encode characters to bytes , or I am missing something. What do you think?

    I don't think the author forgets something, but I'm not sure what you think the author forgets

    Consider these three lines of output, do you see something wrong with them?

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; my $code_point = 0x3042;# HIRAGANA LETTER A aka 12354 my $unicode_string = pack('U*', $code_point); dd 12354 => pack('U*', 12354); dd "UNSIGNED CHARS(W*) ", pack "W*", unpack "U*", $unicode_string.$un +icode_string; dd "UNSIGNED OCTETS(C*) ", unpack "C*", $unicode_string.$unicode_strin +g; __END__ (12354, "\x{3042}") ("UNSIGNED CHARS(W*) ", "\x{3042}\x{3042}") ("UNSIGNED OCTETS(C*) ", 12354, 12354)
      I saw the output...

      C is An unsigned char (octet,8bit) value.
      W An unsigned char value (can be greater than 255).

      So, why "C" values could become greater than 255?

      #unpack "C*", $unicode_string.$unicode_string;
      #("UNSIGNED OCTETS(C*) ", 12354, 12354)

      this seems strange...

      Do you mean my example should use "W" for unpack? If so, Does this make sense? The result is same with my machine. My point is, @bytes is not bytes, it is decimal code points for "HIRAGANA LETTER A".

      $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=unpack("W*", $unicode_string); print join('|', @bytes), "\n"; #==>these are not bytes ,but array + of codepoints $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=map{ sprintf("%X",$_) } unpack("W*", Encode::encode('utf8', +$unicode_string)); print join('|', @bytes), "\n";
      I really should read packtut.
      I am waiting for your replay.

      update:
      I met description of perlunicode:

      " pack("C") and unpack("C") are methods for emulating byte-oriented chr() and ord() on Unicode strings. While these methods reveal the internal encoding of Unicode strings, that is not something one normally needs to care about at all."
      
      so, I think
      # this is wrong @bytes=unpack("C*", $unicode_string); # this is right @byets= unpack("C*", Encode::encode('utf8',$unicode_string));
      doesn't it ?

        So, why "C" values could become greater than 255? this seems strange...

        Its all strange to me, I'm not joking

        From http://perldoc.perl.org/5.14.1/functions/pack.html

        Pack and unpack can operate in two modes: character mode (C0 mode) where the packed string is processed per character, and UTF-8 mode (U0 mode) where the packed string is processed in its UTF-8-encoded Unicode form on a byte-by-byte basis. Character mode is the default unless the format string starts with U . You can always switch mode mid-format with an explicit C0 or U0 in the format. This mode remains in effect until the next mode change, or until the end of the () group it (directly) applies to.

        Using C0 to get Unicode characters while using U0 to get non-Unicode bytes is not necessarily obvious. Probably only the first of these is what you want:

        ...

        Those examples also illustrate that you should not try to use pack/unpack as a substitute for the Encode module.

        So trying that I get

        dd "UNSIGNED OCTETS(C*) ", unpack "C0C*", $unicode_string.$unicode_str +ing; dd "UNSIGNED OCTETS(C*) ", unpack "U0C*", $unicode_string.$unicode_str +ing; __END__ ("UNSIGNED OCTETS(C*) ", 12354, 12354) ("UNSIGNED OCTETS(C*) ", 227, 129, 130, 227, 129, 130)

        So, yes, I think I agree, its a mistake , in that it should probably say You can find the bytes that make up a UTF-8 sequence with:

        @bytes = unpack("U0C*", $Unicode_string);

        And this seems to confirm that

        $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=map{ sprintf("%X",$_) } unpack("U0C*", $unicode_string); print join('|', @bytes), "\n"; __END__ E3|81|82

        update: It says in another part of perluniintro

        One way of peeking inside the internal encoding of Unicode characters is to use unpack("C*", ... to get the bytes of whatever the string encoding happens to be, or unpack("U0..", ...) to get the bytes of the UTF-8 encoding:

        So yeah, whatever perl's actual internal format that we shouldn't care about is, it is not utf8, and if you want the UTF8 bytes, you need U0C*, otherwise (it looks like) you get IV bytes