in reply to Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
I need to convert any non-Ascii characters in a string into their escaped unicode forms (like \u00E3)...
I'm not sure why I am getting this error, since I'm not trying to decode anything.
When you convert from UTF-8 (or UTF-16 or any other 'encoding') to Unicode, you are decoding. When you convert from Unicode to UTF-8, you are encoding. Unicode is an integer like 8634, and writing that integer in hex format (rather than decimal) does not change the fact that it is a Unicode integer. The '\u' says, "Hey, what follows is a Unicode integer in hex format." The decision about how many bytes you want to use to store that Unicode integer in a string is the decision about which encoding to use.I think the problem might be because the Escape routine is expecting UTF8 and the Japanese is in UTF16 or something. I'm not entirely sure...
If you don't know what encoding a string has, you can't convert it to unicode.
UTF-16 is very easy to parse. Whatever is reading the string just blindly reads two byte chunks (16 bits) from the string, and whatever is in those two bytes is a Unicode integer. However, UTF-8 is a tricky encoding. It use from 1-4 bytes to store a Unicode integer. In order to let whatever is reading the string know how many bytes to read for each Unicode integer, UTF-8 uses special markers at the end of each sequence of bytes. UTF-16 doesn't need any special markers because every Unicode integer is stored in two bytes, so whatever is reading the string just reads two bytes at a time.
Now suppose a string is encoded in UTF-16, but the program reading the string is expecting UTF-8. The string reader will start reading bytes and continue until it finds a special marker to notify it that the end of a Unicode integer has been reached. But because the string is encoded in UTF-16, those special markers won't exist.
Here is a concreate example:
0000 0001 0001
If you know the Unicode integer is stored in the first byte(8 bits), then you know that the Unicode integer is: 0000 0001, which is 1 in decimal. However, if the Unicode integer is stored in the first 3 bytes, then the Unicode integer is 17 (=1*16 + 1*1).
In short, unless you tell a program what it should be looking for when reading a string(=the encoding), then the program can't know how many bytes to read for each Unicode integer stored in the string. Remember, a computer can only store numbers, so Unicode integers are actually codes for characters.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
by choroba (Cardinal) on May 04, 2011 at 12:11 UTC | |
by 7stud (Deacon) on May 05, 2011 at 00:47 UTC | |
|
Re^2: Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
by ikegami (Patriarch) on May 05, 2011 at 06:51 UTC |