I'm not familiar with the particular situation you are facing here, but the notion of "null terminated UTF-16 strings" gives me pause...

You are aware, I hope, of this important feature of unicode characters in the range U0000-U00FF (i.e. "Basic Latin" a.k.a. "the ASCII range", and the "C1 Controls and Latin-1 Supplement" a.k.a. \x80-\xFF): when encoded as UTF-16, strings of these characters will have null bytes interspersed throughout -- because the high byte of each fixed-width 16-bit character in that range has all bits set to zero. (updated the wording here for clarity)

In order for a UTF-16 string to be "NULL terminated", I suppose you'd have to be referring to a 16-bit NULL character (two null bytes in a row). Note also that your standard newline characters are 16-bit also: 0x000a 0x000d.

I think the best way to proceed may be to treat UTF-16 stuff as non-character, raw-binary, 16-bit "words". (I wonder how many programmers still use this terminology: 8 bits = 1 byte, 16 bits = 1 word.)

Once you bring the data into perl as raw binary, the perl script must then "unpack" or "decode" it from UTF-16LE (little-endian, since you're on win32) into perl's internal utf8. Check "perldoc -f pack" and "perldoc -f unpack", and the Encode module.

Maybe there's another way, but I'm not familiar with XS stuff in general...

(Update: Once perl has the string converted to utf8, characters in the ASCII range really truely are ASCII (single-byte), so a "null terminated string" becomes a simple concept again.)


In reply to Re: import UTF-16 strings in XS by graff
in thread import UTF-16 strings in XS by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.