in reply to import UTF-16 strings in XS

I'm not familiar with the particular situation you are facing here, but the notion of "null terminated UTF-16 strings" gives me pause...

You are aware, I hope, of this important feature of unicode characters in the range U0000-U00FF (i.e. "Basic Latin" a.k.a. "the ASCII range", and the "C1 Controls and Latin-1 Supplement" a.k.a. \x80-\xFF): when encoded as UTF-16, strings of these characters will have null bytes interspersed throughout -- because the high byte of each fixed-width 16-bit character in that range has all bits set to zero. (updated the wording here for clarity)

In order for a UTF-16 string to be "NULL terminated", I suppose you'd have to be referring to a 16-bit NULL character (two null bytes in a row). Note also that your standard newline characters are 16-bit also: 0x000a 0x000d.

I think the best way to proceed may be to treat UTF-16 stuff as non-character, raw-binary, 16-bit "words". (I wonder how many programmers still use this terminology: 8 bits = 1 byte, 16 bits = 1 word.)

Once you bring the data into perl as raw binary, the perl script must then "unpack" or "decode" it from UTF-16LE (little-endian, since you're on win32) into perl's internal utf8. Check "perldoc -f pack" and "perldoc -f unpack", and the Encode module.

Maybe there's another way, but I'm not familiar with XS stuff in general...

(Update: Once perl has the string converted to utf8, characters in the ASCII range really truely are ASCII (single-byte), so a "null terminated string" becomes a simple concept again.)

Replies are listed 'Best First'.
Re^2: import UTF-16 strings in XS
by Anonymous Monk on Sep 13, 2006 at 19:32 UTC
    Sorry, when I said NULL terminated, I meant with a 16 bit word sized NULL. I thought there might be some well known way to do this, but I guess not. There appears to be a way to call the unpack guts from XS, but no easy way to call any Encode bits, so I guess I'll be using unpack.