comment on

pack and unpack have always been all about how the bytes are laid out in memory. Trying to pretend now that they don't just leads to insane results.

What useful value does pack "U0V", $int have? What does it even represent? Are there processors that implement integers as "four codepoints, encoded in UTF-8"? Would it ever be useful to send that over a socket?

How many hex digits does unpack "H*", pack "U", $codepoint return? UTF-8 defines that quite clearly given the long-standing definition of unpack as dealing with bytes. But in your world view, it should return some variable number of hex digits that has no clear definition already laid out for it. Could it return 3 if that is "enough"? Perhaps it should only return 1 for control characters? Or should it return an even number until all of the bits are taken care of? It is your fantasy world and I have little clue what would "make sense" in such a strange place. What Perl 5.10 does certainly doesn't seem to make much sense (and disagrees with the documentation and is fairly useless).

Now, how many characters does unpack "B*", pack "U", $codepoint return? What does pack "B16", $bits produce when some Unicode bit sneaks into the equation without me noticing? Does it change from generating two bytes to generating two characters, each encoded in UTF-8? Does it generate a 16-bit codepoint that is then encoded into UTF-8? Whatever gets decided, good luck explaining the answer as part of the already-way-too-confusing documentation for pack.

Notice that "U" was actually defined exactly in accordance with my view. It produces the same bytes, even when you add in the crazy "unicode vs. bytes mode" stuff of 5.10. Because pack() has always been about packing bytes into interesting shapes.

Yes, concatenating the output of pack with a UTF-8 string in Perl breaks things. Pretending it doesn't just belies the fact that your data is no longer packed the way that you specified it should be. And then you follow the documentation (and over a decade of precedent) and use unpack "C*" to verify that your octets are exactly as they are supposed to be and Perl 5.10 lies to you.

Perl 5.10 "fixed" something by making the breakage harder to notice. That is no improvement. And it leads to a model for what pack/unpack do that is so confused that it will be tons harder for people to wrap their heads around (and wrapping your head around pack/unpack was already plenty hard). Clearly, the authors of this new paradigm haven't even wrapped their head around what they dreamed up yet, given the mishmash of half-done changes in the behavior of lots of un/pack templates in 5.10.

- tye

In reply to Re^9: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by tye
in thread Parsing UTF-16LE CSV Records Using Text::CSV* by Jim

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.