comment on

So, why "C" values could become greater than 255? this seems strange...

Its all strange to me, I'm not joking

From http://perldoc.perl.org/5.14.1/functions/pack.html

Pack and unpack can operate in two modes: character mode (C0 mode) where the packed string is processed per character, and UTF-8 mode (U0 mode) where the packed string is processed in its UTF-8-encoded Unicode form on a byte-by-byte basis. Character mode is the default unless the format string starts with U . You can always switch mode mid-format with an explicit C0 or U0 in the format. This mode remains in effect until the next mode change, or until the end of the () group it (directly) applies to.

Using C0 to get Unicode characters while using U0 to get non-Unicode bytes is not necessarily obvious. Probably only the first of these is what you want:
...
Those examples also illustrate that you should not try to use pack/unpack as a substitute for the Encode module.

So trying that I get

dd "UNSIGNED OCTETS(C*) ", unpack "C0C*", $unicode_string.$unicode_str
+ing;
dd "UNSIGNED OCTETS(C*) ", unpack "U0C*", $unicode_string.$unicode_str
+ing;
__END__
("UNSIGNED OCTETS(C*) ", 12354, 12354)
("UNSIGNED OCTETS(C*) ", 227, 129, 130, 227, 129, 130)
[download]

So, yes, I think I agree, its a mistake , in that it should probably say You can find the bytes that make up a UTF-8 sequence with:

@bytes = unpack("U0C*", $Unicode_string);
[download]

And this seems to confirm that


    $code_point=0x3042;#HIRAGANA LETTER A
    $unicode_string=pack('U*', $code_point);
    @bytes=map{ sprintf("%X",$_) } unpack("U0C*", $unicode_string);
    print join('|', @bytes), "\n";
__END__
E3|81|82
[download]

update: It says in another part of perluniintro

One way of peeking inside the internal encoding of Unicode characters is to use unpack("C*", ... to get the bytes of whatever the string encoding happens to be, or unpack("U0..", ...) to get the bytes of the UTF-8 encoding:

So yeah, whatever perl's actual internal format that we shouldn't care about is, it is not utf8, and if you want the UTF8 bytes, you need U0C*, otherwise (it looks like) you get IV bytes

In reply to Re^5: Example of perluniintro by Anonymous Monk
in thread Example of perluniintro by remiah

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.