Re^7: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10)

Replies are listed 'Best First'.
Re^8: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by ikegami (Patriarch) on Jul 20, 2009 at 19:34 UTC
As best as I can tell, a code point is a index into a character set. I don't see how that relates [unless you're saying something is a code point if it's internally encoded using one encoding (UTF-8), but somehow it's not if it's internally encoded using another (iso-latin-1)]. All I have is a packed byte with no association to any character set. `unpack 'C', substr("\xA0$s", 0, 1)` doesn't give `0xA0` in 5.8.8, and that's a bug. have always been about how the bytes are encoded into memory Of course. You're saying it also matters how those encoded bytes are encoded, and I disagree with that. `unpack "H", pack "U", 0x1234` results under 5.10 are mostly nonsensical, not "fixed".* I forgot to address this earlier. I can see a case for `unpack 'H*', $characters_higher_than_255` returning something more sensible. Same idea as allowing characters above 255 for 'C': It doesn't hurt anything. It even aids backwards compatibility.	[reply] [d/l] [select]
Re^9: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by tye (Sage) on Jul 20, 2009 at 20:09 UTC
pack and unpack have always been all about how the bytes are laid out in memory. Trying to pretend now that they don't just leads to insane results. What useful value does `pack "U0V", $int` have? What does it even represent? Are there processors that implement integers as "four codepoints, encoded in UTF-8"? Would it ever be useful to send that over a socket? How many hex digits does `unpack "H", pack "U", $codepoint` return? UTF-8 defines that quite clearly given the long-standing definition of unpack as dealing with bytes. But in your world view, it should return some variable number of hex digits that has no clear definition already laid out for it. Could it return 3 if that is "enough"? Perhaps it should only return 1 for control characters? Or should it return an even number until all of the bits are taken care of? It is your fantasy world and I have little clue what would "make sense" in such a strange place. What Perl 5.10 does certainly doesn't seem to make much sense (and disagrees with the documentation and is fairly useless). Now, how many characters does `unpack "B", pack "U", $codepoint` return? What does `pack "B16", $bits` produce when some Unicode bit sneaks into the equation without me noticing? Does it change from generating two bytes to generating two characters, each encoded in UTF-8? Does it generate a 16-bit codepoint that is then encoded into UTF-8? Whatever gets decided, good luck explaining the answer as part of the already-way-too-confusing documentation for pack. Notice that "U" was actually defined exactly in accordance with my view. It produces the same bytes, even when you add in the crazy "unicode vs. bytes mode" stuff of 5.10. Because pack() has always been about packing bytes into interesting shapes. Yes, concatenating the output of pack with a UTF-8 string in Perl breaks things. Pretending it doesn't just belies the fact that your data is no longer packed the way that you specified it should be. And then you follow the documentation (and over a decade of precedent) and use `unpack "C"` to verify that your octets* are exactly as they are supposed to be and Perl 5.10 lies to you. Perl 5.10 "fixed" something by making the breakage harder to notice. That is no improvement. And it leads to a model for what pack/unpack do that is so confused that it will be tons harder for people to wrap their heads around (and wrapping your head around pack/unpack was already plenty hard). Clearly, the authors of this new paradigm haven't even wrapped their head around what they dreamed up yet, given the mishmash of half-done changes in the behavior of lots of un/pack templates in 5.10. - tye	[reply] [d/l] [select]
Re^10: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by ikegami (Patriarch) on Jul 20, 2009 at 20:34 UTC
But in your world view, [`unpack "H", pack "U", $codepoint`] should return some variable number of hex digits that has no clear definition already laid out for it.* `pack` producing something that isn't bytes? Unpacking something that (potentially) isn't bytes? In my world view, the construct doesn't make much sense. I don't have any experience will "U" which is why I didn't comment on it initially. I'm not in a good position to judge what the problem with the new or old method are since I don't even know what kind of problems it solves. I'll read your post when I have more time to absorb it.	[reply] [d/l] [select]