Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

In some XS code, I need a way to take NULL terminated UTF-16 strings that I get from the Win32 API and feed them to perl as Unicode SVs. What's the best way to do this?

Replies are listed 'Best First'.
Re: import UTF-16 strings in XS
by graff (Chancellor) on Sep 13, 2006 at 03:39 UTC
    I'm not familiar with the particular situation you are facing here, but the notion of "null terminated UTF-16 strings" gives me pause...

    You are aware, I hope, of this important feature of unicode characters in the range U0000-U00FF (i.e. "Basic Latin" a.k.a. "the ASCII range", and the "C1 Controls and Latin-1 Supplement" a.k.a. \x80-\xFF): when encoded as UTF-16, strings of these characters will have null bytes interspersed throughout -- because the high byte of each fixed-width 16-bit character in that range has all bits set to zero. (updated the wording here for clarity)

    In order for a UTF-16 string to be "NULL terminated", I suppose you'd have to be referring to a 16-bit NULL character (two null bytes in a row). Note also that your standard newline characters are 16-bit also: 0x000a 0x000d.

    I think the best way to proceed may be to treat UTF-16 stuff as non-character, raw-binary, 16-bit "words". (I wonder how many programmers still use this terminology: 8 bits = 1 byte, 16 bits = 1 word.)

    Once you bring the data into perl as raw binary, the perl script must then "unpack" or "decode" it from UTF-16LE (little-endian, since you're on win32) into perl's internal utf8. Check "perldoc -f pack" and "perldoc -f unpack", and the Encode module.

    Maybe there's another way, but I'm not familiar with XS stuff in general...

    (Update: Once perl has the string converted to utf8, characters in the ASCII range really truely are ASCII (single-byte), so a "null terminated string" becomes a simple concept again.)

      Sorry, when I said NULL terminated, I meant with a 16 bit word sized NULL. I thought there might be some well known way to do this, but I guess not. There appears to be a way to call the unpack guts from XS, but no easy way to call any Encode bits, so I guess I'll be using unpack.
Re: import UTF-16 strings in XS (code)
by tye (Sage) on Sep 13, 2006 at 06:42 UTC
    my $utf8= pack "U*", unpack "S*", $utf16; $utf8 =~ s/\0.*//s;

    provided that you return the UTF-16 string properly to Perl.

    Details about memory allocation complicate exactly how to do this (especially since Perl refuses to deal with memory that it didn't allocate itself). So I can't give much more detail without guessing, so I'll leave it at that for now.

    Oh, and Win32API::Registry knows how to return such strings to Perl "properly" for some APIs; in case some sample code would be useful. Note that I wrapped up this stuff in some macros that I reuse several places.

    - tye        

Re: import UTF-16 strings in XS
by creamygoodness (Curate) on Sep 13, 2006 at 03:49 UTC
    I'm not aware of any perlapi XS functions that deal with UTF-16. Doing something in perlspace using the Encode module or a UTF-16-aware filehandle is probably your best bet. Shy of that, I think you'd have to write your own UTF-16 to UTF-8 converter in C. That's not insane, but it's hard work testing it properly and preparing to deal with all the possible permutations of malformed data.
    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com
      I was going to say that a C implementation of the UTF-16 to UTF-8 conversion would be pretty simple and robust -- in fact, you can probably find a C snippet for this at http://www.unicode.org.

      But it's true that that if you mistakenly feed random (non-UTF-16) data into this sort of conversion, the result might be worse than just "garbage out".

      There are a fair number of "gaps" in the 16-bit space, where Unicode doesn't really have anything defined, as well as some spots that are specifically defined as "not usable characters". And heaven forbid the input data should contain anything in the UTF-16 "Surrogate" range (0xD800-0xDFFF), which is reserved for building "wider" characters using two consecutive 16-bit values (these get rendered into 4-byte utf8 codes, whereas all other UTF-16 code points end up as 1, 2 or 3 bytes in utf8).

        Win32 comes with APIs for converting from UTF-16 (or perhaps something similar, in any case likely referred to as "UNICODE") to UTF-8 (likely called "mutli-byte-character strings"). Unfortunately, this is the wrong computer with too tiny a browser to easily look up the name.

        I prefer to do such conversions in Perl anyway, as it reduces the complexity of the XS code (almost always a good idea) and allows one to avoid converting twice if you end up just passing the output from one API into another.

        If one really wants to do this conversion in C, then I'd strongly encourage providing an XS routine that does just this conversion and then provide a Perl sub to conveniently wrap the 2 (or more) XS calls for the "common case".

        - tye