in reply to Re^3: Parsing UTF-16LE CSV Records Using Text::CSV*
in thread Parsing UTF-16LE CSV Records Using Text::CSV*

and you get character semantics (not byte semantics) when doing stuff with that string

There's no such thing. If an operation behaves differently depending on the internal encoding of the string, it's a bug. These are being fixed. e.g. pack was fixed in 5.10.0. Regex matches and other are being fixed for 5.12. Text::CSV_XS was fixed in 0.46.

that's the point of using "decode()" and the encoding IO layer

Not at all. The point of decode is to decode characters. It has nothing to do with the internal storage of strings.

You can have decoded characters with the utf8 flag off.
You can encoded characters with the utf8 flag on.

If you need to play with the internal encoding, utf8::upgrade and utf8::downgrade are the appropriate tools.

This is what the previously linked document shows.

the "perl-internal utf8" storage of characters in the rang 0x80-0xFF is single-byte.

Impossible. The high bit indicates the presence of a multiple byte char.

$ perl -MEncode -le'print length encode "utf8", decode "UTF-16le", "\x +FE\x00"' 2
or
$ perl -MDevel::Peek -MEncode -le'Dump decode "UTF-16le", "\xFE\x00"' ... PV = 0x8172e78 "\303\276"\0 [UTF8 "\x{fe}"] CUR = 2 ...

U+000000-U+00007F: One byte
U+000080-U+0007FF: Two bytes
U+000800-U+00FFFF: Three bytes
U+010000-U+10FFFF: Four bytes

Replies are listed 'Best First'.
Re^5: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10)
by tye (Sage) on Jul 20, 2009 at 18:42 UTC

    pack was badly broken in 5.10 (and silently changes existing behavior, which should almost never be done). The 5.10 documentation for "pack" says the "c" and "C" are "eight bit" and "octet" but they no longer (always) are. So the documentation is wrong (but the documented behavior is preferrable, especially since it has always worked that way). unpack "H*", pack "U", 0x1234 results under 5.10 are mostly nonsensical, not "fixed".

    Pretending that the encoding of the string should never make a difference is just fooling yourself and leads to confusing magical behaviors that sometimes "do what you mean" but burn you when they don't.

    - tye        

      The 5.10 documentation for "pack" says the "c" and "C" are "eight bit" and "octet" but they no longer (always) are.

      So it works for values there have more than eight bits too. I'm not convinced that's a problem. It could die or warn, but it's more useful if it doesn't.

      At least it works for all eight-bit values. It didn't before 5.10.

      use strict; use warnings; use Test::More tests => 4; my $ch = chr(0xA0); utf8::downgrade( my $dn_ch = $ch ); utf8::upgrade( my $up_ch = $ch ); is(unpack('c', $dn_ch), 0xA0-0x100, 'c, internal format 0'); is(unpack('c', $up_ch), 0xA0-0x100, 'c, internal format 1'); is(unpack('C', $dn_ch), 0xA0, 'C, internal format 0'); is(unpack('C', $up_ch), 0xA0, 'C, internal format 1');

      5.8.8:

      1..4 ok 1 - c, internal format 0 not ok 2 - c, internal format 1 # Failed test 'c, internal format 1' # at 781731.pl line 12. # got: '-62' # expected: '-96' ok 3 - C, internal format 0 not ok 4 - C, internal format 1 # Failed test 'C, internal format 1' # at 781731.pl line 14. # got: '194' # expected: '160' # Looks like you failed 2 tests of 4.

      5.10.0:

      1..4 ok 1 - c, internal format 0 ok 2 - c, internal format 1 ok 3 - C, internal format 0 ok 4 - C, internal format 1

        "C" doesn't unpack 8-bit code points. If you want codepoints, use "U" or ord. It unpacks pre-Unicode characters and the "eight-bit" and "octet" comments are meant to make that clear. It is an accident of history that 'b' was used by "binary" (base 2) not "byte" and so "c" got used for "byte" (via the mnemonic "char").

        That doesn't mean "c" should be changed to mean "character, perhaps multi-byte", especially given the nature of pack and unpack. pack and unpack have always been about how the bytes are encoded into memory. Trying to change them into pretending that they don't care about that is a huge mistake.

        Update:

        It doesn't hurt anything.

        What part of "silent change" don't you understand?

        - tye