Unicode Help

Avox has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unicode Help by halley (Prior) on Mar 17, 2004 at 17:22 UTC
Unicode is a map of numbers vs characters. You refer to 16-bit encodings for characters, which is not the same thing as the general concept as Unicode. It could be any number of different encoding schemes. This doesn't answer your question directly, but you might want to read up on my FMTYEWTK about Characters vs Bytes node to get a better understanding of how to think about encoding. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: Re: Unicode Help by zby (Vicar) on Mar 17, 2004 at 17:33 UTC
In the linked node there is nothing about UTF-16 which the OP apparently needs to use. Here is a page in wikipedia about it: UTF-16.	[reply]
Re: Re: Re: Unicode Help by halley (Prior) on Mar 17, 2004 at 17:37 UTC
I think my point is that it's still just a guess that the application needs UTF-16. Microsoft uses a non-Unicode "DBCS" character set which is not the same as UTF-16, but would look very similar for many simple samples. Assumptions are dangerous. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: Re: Re: Re: Unicode Help by iburrell (Chaplain) on Mar 17, 2004 at 19:13 UTC
Re: Re: Unicode Help by Avox (Sexton) on Mar 17, 2004 at 17:29 UTC
Hey, thanks for the link. My issue isn't really with unicode I think, but trying to encode the string I want to send into a 16bit format. Is there an easy way to do this?	[reply]
Re: Re: Re: Unicode Help by halley (Prior) on Mar 17, 2004 at 17:33 UTC
You'll need to find out what encoding they really want, and then target that encoding. You have two choices: look at some examples and decide something arbitrarily, like "the first byte is ASCII, the second byte is zero," or you can actually find out what the application is expecting. The former can get you running, the latter will avoid sticky problems when a message must include non-ASCII characters like u-with-umlauts or capital-sigma or elvish-parma. If you find the actual encoding standard they expect, you'll probably find a Perl module that will help you encode to that scheme without much fuss. It would be a rare standard that forced you to encode things yourself, but the perl builtin functions `pack` and `unpack` are a good start to your solution. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: Unicode Help by Avox (Sexton) on Mar 17, 2004 at 18:49 UTC
Well, with the help of a bit of network sniffing, I've been comparing the message I'm sending to the one a working client is sending. I've started using the encoding UTF-16 and the both messages are identical except for the first 4 bytes at the begining and the last byte at the end of the "working message". working message begins with: FF FE FF c3 3C not working message begins with: FE FF 00 3C the end of the non working message just needs a 00 tacked on. Does this mean anything to anyone?	[reply]
Re: Re: Unicode Help by iburrell (Chaplain) on Mar 17, 2004 at 19:09 UTC
The 0xFFFE and 0xFEFF are the byte order marks. They are used to determine the byte order with the UTF-16 encodings. You need to ask which byte order of UTF-16 that the server is using. I am guessing that are using UTF-16LE, since that uses 0xFFFE. Using the correct ordering may make the "extra" bytes go away.	[reply]
Re: Re: Unicode Help by zby (Vicar) on Mar 17, 2004 at 19:10 UTC
This does not explain all of your symptomes but it can lead you to some better understanding. From the UTF-16 page on wikipedia: The UTF-16 encoding scheme mandates that the byte order must be declared by prepending a Byte Order Mark before the first serialized character. This BOM is the encoded version of the Zero-Width No-Break Space character, Unicode number FEFF in hex, manifesting as the byte sequence FE FF for big-endian, or FF FE for little-endian. A BOM at the beginning of UTF-16 encoded data is considered to be a signature separate from the text itself; it is for the benefit of the decoder. The UTF-16LE and UTF-16BE encoding schemes are identical to the UTF-16 encoding schemes, but rather than using a BOM, the byte order is implicit in the name of the encoding (LE for little-endian, BE for big-endian). A BOM at the beginning of UTF-16LE or UTF-16BE encoded data is not considered to be a BOM; it is part of the text itself.	[reply]
Re: Unicode Help by Avox (Sexton) on Mar 17, 2004 at 19:15 UTC
Seriously people, thanks alot for all the help. I'll give the little endian stuff a try and report back.	[reply]
Re: Re: Unicode Help by Avox (Sexton) on Apr 01, 2004 at 15:57 UTC
Well, due to deadlines, I wasn't able to pursue the pure perl version of this as I'd hoped. I ended up writing a little MFC command line app to send the message. Since I compiled it using the MS unicode, it all works. I just execute the command line app via perl now. When I get some time, i still hope to go back and figure this out. Thanks everyone...	[reply]