Re^2: Parsing UTF-16LE CSV Records Using Text::CSV*

Replies are listed 'Best First'.
Re^3: Parsing UTF-16LE CSV Records Using Text::CSV* by graff (Chancellor) on Jul 20, 2009 at 13:52 UTC
Boo for equating the string's internal encoding with whether it's been decoded or not. If you "decode()" a non-ascii, non-utf8 string (or if it passes through a decoding IO layer on input), and the operation is successful, the string value returned by decode() has the utf8 flag on, and you get character semantics (not byte semantics) when doing stuff with that string -- that's the point of using "decode()" and the encoding IO layer, and that's all I was talking about in my suggestion. (My reply may well have been less than fully helpful for other reasons.) As for U+00FE, perhaps I'm just behind the times, not having taken time to explore all the details of 5.10 yet. In 5.8.8, the "perl-internal utf8" storage of characters in the rang 0x80-0xFF is single-byte. They would be converted to multi-byte on output to a utf8-mode file handle. I don't recall at the moment what particular operations are sensitive to (or would reveal) this distinction, but it's there.	[reply]
Re^4: Parsing UTF-16LE CSV Records Using Text::CSV* by ikegami (Patriarch) on Jul 20, 2009 at 16:21 UTC
and you get character semantics (not byte semantics) when doing stuff with that string There's no such thing. If an operation behaves differently depending on the internal encoding of the string, it's a bug. These are being fixed. e.g. `pack` was fixed in 5.10.0. Regex matches and other are being fixed for 5.12. Text::CSV_XS was fixed in 0.46. that's the point of using "decode()" and the encoding IO layer Not at all. The point of `decode` is to decode characters. It has nothing to do with the internal storage of strings. You can have decoded characters with the utf8 flag off. You can encoded characters with the utf8 flag on. If you need to play with the internal encoding, `utf8::upgrade` and `utf8::downgrade` are the appropriate tools. This is what the previously linked document shows. the "perl-internal utf8" storage of characters in the rang 0x80-0xFF is single-byte. Impossible. The high bit indicates the presence of a multiple byte char. `$ perl -MEncode -le'print length encode "utf8", decode "UTF-16le", "\x +FE\x00"' 2` [download] or `$ perl -MDevel::Peek -MEncode -le'Dump decode "UTF-16le", "\xFE\x00"' ... PV = 0x8172e78 "\303\276"\0 [UTF8 "\x{fe}"] CUR = 2 ...` [download] U+000000-U+00007F: One byte U+000080-U+0007FF: Two bytes U+000800-U+00FFFF: Three bytes U+010000-U+10FFFF: Four bytes	[reply] [d/l] [select]
Re^5: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by tye (Sage) on Jul 20, 2009 at 18:42 UTC
pack was badly broken in 5.10 (and silently changes existing behavior, which should almost never be done). The 5.10 documentation for "pack" says the "c" and "C" are "eight bit" and "octet" but they no longer (always) are. So the documentation is wrong (but the documented behavior is preferrable, especially since it has always worked that way). `unpack "H*", pack "U", 0x1234` results under 5.10 are mostly nonsensical, not "fixed". Pretending that the encoding of the string should never make a difference is just fooling yourself and leads to confusing magical behaviors that sometimes "do what you mean" but burn you when they don't. - tye	[reply] [d/l]
Re^6: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by ikegami (Patriarch) on Jul 20, 2009 at 19:06 UTC
Re^7: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by tye (Sage) on Jul 20, 2009 at 19:16 UTC
Some notes below your chosen depth have not been shown here