Re: UTF8 to UTF16, but...

Since your third-party folks are not literally, specifically requesting either UTF-16BE or UTF-16LE, you might need to include a byte-order-mark (BOM, "\x{feff}", aka "zero-width-no-break space") as the first character of a message:

perl -MEncode -le '$str="\x{0442}\x{0435}\x{0441}\x{0442}";
  $sms=unpack("H*",encode("UTF-16",$str)); print $sms'

#prints: feff0442043504410442
[download]

(Note how the Encode module puts the BOM in for you automatically when you tell it to create "UTF-16" data, as opposed to "UTF-16[LB]E", which will not get an automatic BOM included.)

You just need to make sure about what is really being requested -- if the BOM screws things up, and almut's guess about it being big-endian turns out to be wrong, then it must be little-endian ("UTF-16LE").

It can be remarkably easy for this to be a matter of confusion -- many folks are not accustomed to being as specific as they should be when describing their unicode needs, and some tools for viewing wide-character byte sequences may give you a distorted view.

Comment on Re: UTF8 to UTF16, but... Select or Download Code

Replies are listed 'Best First'.
Re^2: UTF8 to UTF16, but... by almut (Canon) on Dec 12, 2008 at 02:32 UTC
(...) and almut's guess about it being big-endian turns out to be wrong, then it must be little-endian ("UTF-16LE"). Not exactly a 'guess' :) Given the OP said that "0442043504410442" corresponds to the sample word "тест", it cannot really be little-endian, because that would be "4204350441044204". Maybe it's worth noting that "UTF-16" with `encode()` assumes "BE" (quote from Encode::Unicode): "When BE or LE is omitted during encode(), it returns a BE-encoded string with BOM prepended. So when you want to encode a whole text file, make sure you encode() the whole text at once, not line by line or each line, not file, will have a BOM prepended." Of course, from the sample word alone we cannot tell whether a BOM is required.	[reply] [d/l]
Re^3: UTF8 to UTF16, but... by graff (Chancellor) on Dec 12, 2008 at 04:20 UTC
Given the OP said that "0442043504410442" corresponds... Well, since the OP didn't say exactly what sort of method was used to format that string of 16 hex digits for the 8 bytes, I'd have to remain in doubt about what the underlying byte order really is. There might have been some filtering behavior between reading and displaying that put the bytes into "logical" order for viewing by humans, being "smart" enough to know when byte swapping was necessary. It's not a complicated matter, and is easy to test -- I'm just saying you can't be sure until you test it.	[reply]