in reply to Re: CSV nightmare
in thread CSV nightmare
I used to think that UCS-2LE was a synonym for UTF-16LE (likewise with BE instead of LE). But I then found this bit in a FAQ at unicode.org:
Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is what a Unicode implementation was up to Unicode 1.1, before surrogate code points and UTF-16 were added as concepts to Version 2.0 of the standard. This term should be now be avoided.
When interpreting what people have meant by "UCS-2" in past usage, it is best thought of as not a data format, but as an indication that an implementation does not interpret any supplementary characters. In particular, for the purposes of data exchange, UCS-2 and UTF-16 are identical formats. Both are 16-bit, and have exactly the same code unit representation.
The effective difference between UCS-2 and UTF-16 lies at a different level, when one is interpreting a sequence code units as code points or as characters. In that case, a UCS-2 implementation would not handle processing like character properties, codepoint boundaries, collation, etc. for supplementary characters.
The current version of Unicode is 5.something, so there's not much point in using the "UCS-2" terminology these days.
And yes, Microsoft tends to favor the LE byte order, especially for text data (MS-Word, and "plain-text" exports from MS-Office products). But cells in Excel spreadsheets are, for some reason, stored as BE.
As far as perl encoding layers are concerned, UTF-16 (with no byte-order spec) tends to mean: for output, byte order is determined by the cpu apparently BE by default; for input, byte order is determined by a stream-initial BOM (if the BOM isn't there, perl complains about it; if it is there, perl does not will remove it for you).
(updated last paragraph to reflect ikegami's corrections -- thanks, ike)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: CSV nightmare
by ikegami (Patriarch) on Jun 03, 2008 at 03:18 UTC |