I'm not familiar with the particular situation you are facing here, but the notion of "null terminated UTF-16 strings" gives me pause...
You are aware, I hope, of this important feature of unicode characters in the range U0000-U00FF (i.e. "Basic Latin" a.k.a. "the ASCII range", and the "C1 Controls and Latin-1 Supplement" a.k.a. \x80-\xFF): when encoded as UTF-16, strings of these characters will have null bytes interspersed throughout -- because the high byte of each fixed-width 16-bit character in that range has all bits set to zero. (updated the wording here for clarity)
In order for a UTF-16 string to be "NULL terminated", I suppose you'd have to be referring to a 16-bit NULL character (two null bytes in a row). Note also that your standard newline characters are 16-bit also: 0x000a 0x000d.
I think the best way to proceed may be to treat UTF-16 stuff as non-character, raw-binary, 16-bit "words". (I wonder how many programmers still use this terminology: 8 bits = 1 byte, 16 bits = 1 word.)
Once you bring the data into perl as raw binary, the perl script must then "unpack" or "decode" it from UTF-16LE (little-endian, since you're on win32) into perl's internal utf8. Check "perldoc -f pack" and "perldoc -f unpack", and the Encode module.
Maybe there's another way, but I'm not familiar with XS stuff in general...
(Update: Once perl has the string converted to utf8, characters in the ASCII range really truely are ASCII (single-byte), so a "null terminated string" becomes a simple concept again.) |