in reply to Re^4: Parsing UTF-16LE CSV Records Using Text::CSV*
in thread Parsing UTF-16LE CSV Records Using Text::CSV*

pack was badly broken in 5.10 (and silently changes existing behavior, which should almost never be done). The 5.10 documentation for "pack" says the "c" and "C" are "eight bit" and "octet" but they no longer (always) are. So the documentation is wrong (but the documented behavior is preferrable, especially since it has always worked that way). unpack "H*", pack "U", 0x1234 results under 5.10 are mostly nonsensical, not "fixed".

Pretending that the encoding of the string should never make a difference is just fooling yourself and leads to confusing magical behaviors that sometimes "do what you mean" but burn you when they don't.

- tye        

Replies are listed 'Best First'.
Re^6: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10)
by ikegami (Patriarch) on Jul 20, 2009 at 19:06 UTC

    The 5.10 documentation for "pack" says the "c" and "C" are "eight bit" and "octet" but they no longer (always) are.

    So it works for values there have more than eight bits too. I'm not convinced that's a problem. It could die or warn, but it's more useful if it doesn't.

    At least it works for all eight-bit values. It didn't before 5.10.

    use strict; use warnings; use Test::More tests => 4; my $ch = chr(0xA0); utf8::downgrade( my $dn_ch = $ch ); utf8::upgrade( my $up_ch = $ch ); is(unpack('c', $dn_ch), 0xA0-0x100, 'c, internal format 0'); is(unpack('c', $up_ch), 0xA0-0x100, 'c, internal format 1'); is(unpack('C', $dn_ch), 0xA0, 'C, internal format 0'); is(unpack('C', $up_ch), 0xA0, 'C, internal format 1');

    5.8.8:

    1..4 ok 1 - c, internal format 0 not ok 2 - c, internal format 1 # Failed test 'c, internal format 1' # at 781731.pl line 12. # got: '-62' # expected: '-96' ok 3 - C, internal format 0 not ok 4 - C, internal format 1 # Failed test 'C, internal format 1' # at 781731.pl line 14. # got: '194' # expected: '160' # Looks like you failed 2 tests of 4.

    5.10.0:

    1..4 ok 1 - c, internal format 0 ok 2 - c, internal format 1 ok 3 - C, internal format 0 ok 4 - C, internal format 1

      "C" doesn't unpack 8-bit code points. If you want codepoints, use "U" or ord. It unpacks pre-Unicode characters and the "eight-bit" and "octet" comments are meant to make that clear. It is an accident of history that 'b' was used by "binary" (base 2) not "byte" and so "c" got used for "byte" (via the mnemonic "char").

      That doesn't mean "c" should be changed to mean "character, perhaps multi-byte", especially given the nature of pack and unpack. pack and unpack have always been about how the bytes are encoded into memory. Trying to change them into pretending that they don't care about that is a huge mistake.

      Update:

      It doesn't hurt anything.

      What part of "silent change" don't you understand?

      - tye        

        As best as I can tell, a code point is a index into a character set. I don't see how that relates [unless you're saying something is a code point if it's internally encoded using one encoding (UTF-8), but somehow it's not if it's internally encoded using another (iso-latin-1)]. All I have is a packed byte with no association to any character set.

        unpack 'C', substr("\xA0$s", 0, 1) doesn't give 0xA0 in 5.8.8, and that's a bug.

        have always been about how the bytes are encoded into memory

        Of course. You're saying it also matters how those encoded bytes are encoded, and I disagree with that.

        unpack "H*", pack "U", 0x1234 results under 5.10 are mostly nonsensical, not "fixed".

        I forgot to address this earlier.

        I can see a case for unpack 'H*', $characters_higher_than_255 returning something more sensible. Same idea as allowing characters above 255 for 'C': It doesn't hurt anything. It even aids backwards compatibility.