in reply to CSV nightmare

open(my $data, '<:encoding(UCS-2le)', $file)

Actually, due to problems with the placement of the :crlf layer,

open(my $data, '<:raw:encoding(UCS-2le):crlf:utf8', $file)

And the funny char is the BOM (U+FEFF).

read($data, my $bom='', 1); # Discard BOM.

(To the best of my knowledge, what Microsoft uses is not really UTF-16 but UCS-2le.)

Replies are listed 'Best First'.
Re^2: CSV nightmare
by graff (Chancellor) on Jun 03, 2008 at 02:49 UTC
    (To the best of my knowledge, what Microsoft uses is not really UTF-16 but UCS-2le.)

    I used to think that UCS-2LE was a synonym for UTF-16LE (likewise with BE instead of LE). But I then found this bit in a FAQ at unicode.org:

    Q: What is the difference between UCS-2 and UTF-16?

    A: UCS-2 is what a Unicode implementation was up to Unicode 1.1, before surrogate code points and UTF-16 were added as concepts to Version 2.0 of the standard. This term should be now be avoided.

    When interpreting what people have meant by "UCS-2" in past usage, it is best thought of as not a data format, but as an indication that an implementation does not interpret any supplementary characters. In particular, for the purposes of data exchange, UCS-2 and UTF-16 are identical formats. Both are 16-bit, and have exactly the same code unit representation.

    The effective difference between UCS-2 and UTF-16 lies at a different level, when one is interpreting a sequence code units as code points or as characters. In that case, a UCS-2 implementation would not handle processing like character properties, codepoint boundaries, collation, etc. for supplementary characters.

    The current version of Unicode is 5.something, so there's not much point in using the "UCS-2" terminology these days.

    And yes, Microsoft tends to favor the LE byte order, especially for text data (MS-Word, and "plain-text" exports from MS-Office products). But cells in Excel spreadsheets are, for some reason, stored as BE.

    As far as perl encoding layers are concerned, UTF-16 (with no byte-order spec) tends to mean: for output, byte order is determined by the cpu apparently BE by default; for input, byte order is determined by a stream-initial BOM (if the BOM isn't there, perl complains about it; if it is there, perl does not will remove it for you).

    (updated last paragraph to reflect ikegami's corrections -- thanks, ike)

      To me, there are two important differences between UCS-2 and UTF-16.

      The first important difference is that UCS-2 can only represent U+0000 to U+FFFF, whereas UTF-16 can represent any UNICODE character.

      The second important difference is the number of bytes UCS-2 and UTF-16 use to store a character. Each UCS-2 character is exactly 16 bits in size, whereas UTF-16 is like UTF-8. Some characters require more than one word.

      for output, byte order is determined by the cpu

      No. I'm on an x86 (LE machine), but UTF-16be was used.

      for input, byte order is determined by a stream-initial BOM (if the BOM isn't there, perl complains about it; if it is there, perl does not remove it for you).

      No. Perl *does* remove it for you, just like it adds it for you for output.