Re^5: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10)

Replies are listed 'Best First'.
Re^6: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by ikegami (Patriarch) on Jul 20, 2009 at 19:06 UTC
The 5.10 documentation for "pack" says the "c" and "C" are "eight bit" and "octet" but they no longer (always) are. So it works for values there have more than eight bits too. I'm not convinced that's a problem. It could die or warn, but it's more useful if it doesn't. At least it works for all eight-bit values. It didn't before 5.10. `use strict; use warnings; use Test::More tests => 4; my $ch = chr(0xA0); utf8::downgrade( my $dn_ch = $ch ); utf8::upgrade( my $up_ch = $ch ); is(unpack('c', $dn_ch), 0xA0-0x100, 'c, internal format 0'); is(unpack('c', $up_ch), 0xA0-0x100, 'c, internal format 1'); is(unpack('C', $dn_ch), 0xA0, 'C, internal format 0'); is(unpack('C', $up_ch), 0xA0, 'C, internal format 1');` [download] 5.8.8: `1..4 ok 1 - c, internal format 0 not ok 2 - c, internal format 1 # Failed test 'c, internal format 1' # at 781731.pl line 12. # got: '-62' # expected: '-96' ok 3 - C, internal format 0 not ok 4 - C, internal format 1 # Failed test 'C, internal format 1' # at 781731.pl line 14. # got: '194' # expected: '160' # Looks like you failed 2 tests of 4.` [download] 5.10.0: `1..4 ok 1 - c, internal format 0 ok 2 - c, internal format 1 ok 3 - C, internal format 0 ok 4 - C, internal format 1` [download]	[reply] [d/l] [select]
Re^7: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by tye (Sage) on Jul 20, 2009 at 19:16 UTC
"C" doesn't unpack 8-bit code points. If you want codepoints, use "U" or ord. It unpacks pre-Unicode characters and the "eight-bit" and "octet" comments are meant to make that clear. It is an accident of history that 'b' was used by "binary" (base 2) not "byte" and so "c" got used for "byte" (via the mnemonic "char"). That doesn't mean "c" should be changed to mean "character, perhaps multi-byte", especially given the nature of pack and unpack. pack and unpack have always been about how the bytes are encoded into memory. Trying to change them into pretending that they don't care about that is a huge mistake. Update: It doesn't hurt anything. What part of "silent change" don't you understand? - tye	[reply]
Re^8: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by ikegami (Patriarch) on Jul 20, 2009 at 19:34 UTC
As best as I can tell, a code point is a index into a character set. I don't see how that relates [unless you're saying something is a code point if it's internally encoded using one encoding (UTF-8), but somehow it's not if it's internally encoded using another (iso-latin-1)]. All I have is a packed byte with no association to any character set. `unpack 'C', substr("\xA0$s", 0, 1)` doesn't give `0xA0` in 5.8.8, and that's a bug. have always been about how the bytes are encoded into memory Of course. You're saying it also matters how those encoded bytes are encoded, and I disagree with that. `unpack "H", pack "U", 0x1234` results under 5.10 are mostly nonsensical, not "fixed".* I forgot to address this earlier. I can see a case for `unpack 'H*', $characters_higher_than_255` returning something more sensible. Same idea as allowing characters above 255 for 'C': It doesn't hurt anything. It even aids backwards compatibility.	[reply] [d/l] [select]
Re^9: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by tye (Sage) on Jul 20, 2009 at 20:09 UTC
Re^10: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by ikegami (Patriarch) on Jul 20, 2009 at 20:34 UTC