(To the best of my knowledge, what Microsoft uses is not really UTF-16 but UCS-2le.)

I used to think that UCS-2LE was a synonym for UTF-16LE (likewise with BE instead of LE). But I then found this bit in a FAQ at unicode.org:

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is what a Unicode implementation was up to Unicode 1.1, before surrogate code points and UTF-16 were added as concepts to Version 2.0 of the standard. This term should be now be avoided.

When interpreting what people have meant by "UCS-2" in past usage, it is best thought of as not a data format, but as an indication that an implementation does not interpret any supplementary characters. In particular, for the purposes of data exchange, UCS-2 and UTF-16 are identical formats. Both are 16-bit, and have exactly the same code unit representation.

The effective difference between UCS-2 and UTF-16 lies at a different level, when one is interpreting a sequence code units as code points or as characters. In that case, a UCS-2 implementation would not handle processing like character properties, codepoint boundaries, collation, etc. for supplementary characters.

The current version of Unicode is 5.something, so there's not much point in using the "UCS-2" terminology these days.

And yes, Microsoft tends to favor the LE byte order, especially for text data (MS-Word, and "plain-text" exports from MS-Office products). But cells in Excel spreadsheets are, for some reason, stored as BE.

As far as perl encoding layers are concerned, UTF-16 (with no byte-order spec) tends to mean: for output, byte order is determined by the cpu apparently BE by default; for input, byte order is determined by a stream-initial BOM (if the BOM isn't there, perl complains about it; if it is there, perl does not will remove it for you).

(updated last paragraph to reflect ikegami's corrections -- thanks, ike)


In reply to Re^2: CSV nightmare by graff
in thread CSV nightmare by lorenzov

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.