UTF8 to UTF16, but...

natol44 has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

We use a third-party sms gateway to send out sms from our server. To send non usual characters (Russian for example) they need to have what they say "unicode utf-16" data.

We display a form for our users on our website, in an utf-8 encoded html page. The user submits his $message (so in Russian or German fore example), our perl script registers the data etc, and finally sends the sms.

The sms gateway requests to have data like for example "0442043504410442" for the russian word "тест". We tried a few conversions with the perl script and could never get this string 0442043504410442 for this word.

For example:

use Text::Iconv; # the iconv command perl library
my $converter = Text::Iconv->new("UTF8", "UTF16");
my $converted = $converter->convert($message);

$converted is not the expected string. I guess that the result is a decimal form of utf16 or something like this, but how to get it in Perl?

I add that the string is not to be displayed, but just sent to the gateway.

Thank you!

Comment on UTF8 to UTF16, but...

Replies are listed 'Best First'.
Re: UTF8 to UTF16, but... by almut (Canon) on Dec 11, 2008 at 19:45 UTC
Seems to be the big-endian UTF-16 representation as plain hex: `#!/usr/bin/perl use Encode; # Perl char string (UTF-8) - i.e. your input # (this is the Russian word - literal UTF-8 chars wouldn't be displaye +d # correctly in PerlMonks <code> blocks - thus here as \x{....} ) my $str = "\x{0442}\x{0435}\x{0441}\x{0442}"; my $utf16 = encode("UTF-16be", $str); my $sms = unpack "H*", $utf16; print "$sms\n"; # "0442043504410442"` [download]	[reply] [d/l]
Re: UTF8 to UTF16, but... by graff (Chancellor) on Dec 12, 2008 at 01:39 UTC
Since your third-party folks are not literally, specifically requesting either UTF-16BE or UTF-16LE, you might need to include a byte-order-mark (BOM, "\x{feff}", aka "zero-width-no-break space") as the first character of a message: `perl -MEncode -le '$str="\x{0442}\x{0435}\x{0441}\x{0442}"; $sms=unpack("H*",encode("UTF-16",$str)); print $sms' #prints: feff0442043504410442` [download] (Note how the Encode module puts the BOM in for you automatically when you tell it to create "UTF-16" data, as opposed to `"UTF-16[LB]E"`, which will not get an automatic BOM included.) You just need to make sure about what is really being requested -- if the BOM screws things up, and almut's guess about it being big-endian turns out to be wrong, then it must be little-endian ("UTF-16LE"). It can be remarkably easy for this to be a matter of confusion -- many folks are not accustomed to being as specific as they should be when describing their unicode needs, and some tools for viewing wide-character byte sequences may give you a distorted view.	[reply] [d/l] [select]
Re^2: UTF8 to UTF16, but... by almut (Canon) on Dec 12, 2008 at 02:32 UTC
(...) and almut's guess about it being big-endian turns out to be wrong, then it must be little-endian ("UTF-16LE"). Not exactly a 'guess' :) Given the OP said that "0442043504410442" corresponds to the sample word "тест", it cannot really be little-endian, because that would be "4204350441044204". Maybe it's worth noting that "UTF-16" with `encode()` assumes "BE" (quote from Encode::Unicode): "When BE or LE is omitted during encode(), it returns a BE-encoded string with BOM prepended. So when you want to encode a whole text file, make sure you encode() the whole text at once, not line by line or each line, not file, will have a BOM prepended." Of course, from the sample word alone we cannot tell whether a BOM is required.	[reply] [d/l]
Re^3: UTF8 to UTF16, but... by graff (Chancellor) on Dec 12, 2008 at 04:20 UTC
Given the OP said that "0442043504410442" corresponds... Well, since the OP didn't say exactly what sort of method was used to format that string of 16 hex digits for the 8 bytes, I'd have to remain in doubt about what the underlying byte order really is. There might have been some filtering behavior between reading and displaying that put the bytes into "logical" order for viewing by humans, being "smart" enough to know when byte swapping was necessary. It's not a complicated matter, and is easy to test -- I'm just saying you can't be sure until you test it.	[reply]