in reply to Re^6: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10)
in thread Parsing UTF-16LE CSV Records Using Text::CSV*

"C" doesn't unpack 8-bit code points. If you want codepoints, use "U" or ord. It unpacks pre-Unicode characters and the "eight-bit" and "octet" comments are meant to make that clear. It is an accident of history that 'b' was used by "binary" (base 2) not "byte" and so "c" got used for "byte" (via the mnemonic "char").

That doesn't mean "c" should be changed to mean "character, perhaps multi-byte", especially given the nature of pack and unpack. pack and unpack have always been about how the bytes are encoded into memory. Trying to change them into pretending that they don't care about that is a huge mistake.

Update:

It doesn't hurt anything.

What part of "silent change" don't you understand?

- tye        

  • Comment on Re^7: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10)

Replies are listed 'Best First'.
Re^8: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10)
by ikegami (Patriarch) on Jul 20, 2009 at 19:34 UTC

    As best as I can tell, a code point is a index into a character set. I don't see how that relates [unless you're saying something is a code point if it's internally encoded using one encoding (UTF-8), but somehow it's not if it's internally encoded using another (iso-latin-1)]. All I have is a packed byte with no association to any character set.

    unpack 'C', substr("\xA0$s", 0, 1) doesn't give 0xA0 in 5.8.8, and that's a bug.

    have always been about how the bytes are encoded into memory

    Of course. You're saying it also matters how those encoded bytes are encoded, and I disagree with that.

    unpack "H*", pack "U", 0x1234 results under 5.10 are mostly nonsensical, not "fixed".

    I forgot to address this earlier.

    I can see a case for unpack 'H*', $characters_higher_than_255 returning something more sensible. Same idea as allowing characters above 255 for 'C': It doesn't hurt anything. It even aids backwards compatibility.

      pack and unpack have always been all about how the bytes are laid out in memory. Trying to pretend now that they don't just leads to insane results.

      What useful value does pack "U0V", $int have? What does it even represent? Are there processors that implement integers as "four codepoints, encoded in UTF-8"? Would it ever be useful to send that over a socket?

      How many hex digits does unpack "H*", pack "U", $codepoint return? UTF-8 defines that quite clearly given the long-standing definition of unpack as dealing with bytes. But in your world view, it should return some variable number of hex digits that has no clear definition already laid out for it. Could it return 3 if that is "enough"? Perhaps it should only return 1 for control characters? Or should it return an even number until all of the bits are taken care of? It is your fantasy world and I have little clue what would "make sense" in such a strange place. What Perl 5.10 does certainly doesn't seem to make much sense (and disagrees with the documentation and is fairly useless).

      Now, how many characters does unpack "B*", pack "U", $codepoint return? What does pack "B16", $bits produce when some Unicode bit sneaks into the equation without me noticing? Does it change from generating two bytes to generating two characters, each encoded in UTF-8? Does it generate a 16-bit codepoint that is then encoded into UTF-8? Whatever gets decided, good luck explaining the answer as part of the already-way-too-confusing documentation for pack.

      Notice that "U" was actually defined exactly in accordance with my view. It produces the same bytes, even when you add in the crazy "unicode vs. bytes mode" stuff of 5.10. Because pack() has always been about packing bytes into interesting shapes.

      Yes, concatenating the output of pack with a UTF-8 string in Perl breaks things. Pretending it doesn't just belies the fact that your data is no longer packed the way that you specified it should be. And then you follow the documentation (and over a decade of precedent) and use unpack "C*" to verify that your octets are exactly as they are supposed to be and Perl 5.10 lies to you.

      Perl 5.10 "fixed" something by making the breakage harder to notice. That is no improvement. And it leads to a model for what pack/unpack do that is so confused that it will be tons harder for people to wrap their heads around (and wrapping your head around pack/unpack was already plenty hard). Clearly, the authors of this new paradigm haven't even wrapped their head around what they dreamed up yet, given the mishmash of half-done changes in the behavior of lots of un/pack templates in 5.10.

      - tye        

        But in your world view, [unpack "H*", pack "U", $codepoint] should return some variable number of hex digits that has no clear definition already laid out for it.

        pack producing something that isn't bytes? Unpacking something that (potentially) isn't bytes? In my world view, the construct doesn't make much sense.

        I don't have any experience will "U" which is why I didn't comment on it initially. I'm not in a good position to judge what the problem with the new or old method are since I don't even know what kind of problems it solves. I'll read your post when I have more time to absorb it.