Re^2: UTF8 to UTF16, but...

(...) and almut's guess about it being big-endian turns out to be wrong, then it must be little-endian ("UTF-16LE").

Not exactly a 'guess' :) Given the OP said that "0442043504410442" corresponds to the sample word "тест", it cannot really be little-endian, because that would be "4204350441044204".

Maybe it's worth noting that "UTF-16" with encode() assumes "BE" (quote from Encode::Unicode):

"When BE or LE is omitted during encode(), it returns a BE-encoded string with BOM prepended. So when you want to encode a whole text file, make sure you encode() the whole text at once, not line by line or each line, not file, will have a BOM prepended."

Of course, from the sample word alone we cannot tell whether a BOM is required.

Comment on Re^2: UTF8 to UTF16, but... Download Code

Replies are listed 'Best First'.
Re^3: UTF8 to UTF16, but... by graff (Chancellor) on Dec 12, 2008 at 04:20 UTC
Given the OP said that "0442043504410442" corresponds... Well, since the OP didn't say exactly what sort of method was used to format that string of 16 hex digits for the 8 bytes, I'd have to remain in doubt about what the underlying byte order really is. There might have been some filtering behavior between reading and displaying that put the bytes into "logical" order for viewing by humans, being "smart" enough to know when byte swapping was necessary. It's not a complicated matter, and is easy to test -- I'm just saying you can't be sure until you test it.	[reply]

Replies are listed 'Best First'.

Re^3: UTF8 to UTF16, but...
by graff (Chancellor) on Dec 12, 2008 at 04:20 UTC

Given the OP said that "0442043504410442" corresponds...

Well, since the OP didn't say exactly what sort of method was used to format that string of 16 hex digits for the 8 bytes, I'd have to remain in doubt about what the underlying byte order really is. There might have been some filtering behavior between reading and displaying that put the bytes into "logical" order for viewing by humans, being "smart" enough to know when byte swapping was necessary.

It's not a complicated matter, and is easy to test -- I'm just saying you can't be sure until you test it.

[reply]