I was going to say that a C implementation of the UTF-16 to UTF-8 conversion would be pretty simple and robust -- in fact, you can probably find a C snippet for this at http://www.unicode.org.
But it's true that that if you mistakenly feed random (non-UTF-16) data into this sort of conversion, the result might be worse than just "garbage out".
There are a fair number of "gaps" in the 16-bit space, where Unicode doesn't really have anything defined, as well as some spots that are specifically defined as "not usable characters". And heaven forbid the input data should contain anything in the UTF-16 "Surrogate" range (0xD800-0xDFFF), which is reserved for building "wider" characters using two consecutive 16-bit values (these get rendered into 4-byte utf8 codes, whereas all other UTF-16 code points end up as 1, 2 or 3 bytes in utf8). | [reply] |
Win32 comes with APIs for converting from UTF-16 (or perhaps something similar, in any case likely referred to as "UNICODE") to UTF-8 (likely called "mutli-byte-character strings"). Unfortunately, this is the wrong computer with too tiny a browser to easily look up the name.
I prefer to do such conversions in Perl anyway, as it reduces the complexity of the XS code (almost always a good idea) and allows one to avoid converting twice if you end up just passing the output from one API into another.
If one really wants to do this conversion in C, then I'd strongly encourage providing an XS routine that does just this conversion and then provide a Perl sub to conveniently wrap the 2 (or more) XS calls for the "common case".
| [reply] |