in reply to Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!

I need to convert any non-Ascii characters in a string into their escaped unicode forms (like \u00E3)...

I'm not sure why I am getting this error, since I'm not trying to decode anything.

When you convert from UTF-8 (or UTF-16 or any other 'encoding') to Unicode, you are decoding. When you convert from Unicode to UTF-8, you are encoding. Unicode is an integer like 8634, and writing that integer in hex format (rather than decimal) does not change the fact that it is a Unicode integer. The '\u' says, "Hey, what follows is a Unicode integer in hex format." The decision about how many bytes you want to use to store that Unicode integer in a string is the decision about which encoding to use.

I think the problem might be because the Escape routine is expecting UTF8 and the Japanese is in UTF16 or something. I'm not entirely sure...

If you don't know what encoding a string has, you can't convert it to unicode.

UTF-16 is very easy to parse. Whatever is reading the string just blindly reads two byte chunks (16 bits) from the string, and whatever is in those two bytes is a Unicode integer. However, UTF-8 is a tricky encoding. It use from 1-4 bytes to store a Unicode integer. In order to let whatever is reading the string know how many bytes to read for each Unicode integer, UTF-8 uses special markers at the end of each sequence of bytes. UTF-16 doesn't need any special markers because every Unicode integer is stored in two bytes, so whatever is reading the string just reads two bytes at a time.

Now suppose a string is encoded in UTF-16, but the program reading the string is expecting UTF-8. The string reader will start reading bytes and continue until it finds a special marker to notify it that the end of a Unicode integer has been reached. But because the string is encoded in UTF-16, those special markers won't exist.

Here is a concreate example:

0000 0001 0001

If you know the Unicode integer is stored in the first byte(8 bits), then you know that the Unicode integer is: 0000 0001, which is 1 in decimal. However, if the Unicode integer is stored in the first 3 bytes, then the Unicode integer is 17 (=1*16 + 1*1).

In short, unless you tell a program what it should be looking for when reading a string(=the encoding), then the program can't know how many bytes to read for each Unicode integer stored in the string. Remember, a computer can only store numbers, so Unicode integers are actually codes for characters.

  • Comment on Re: Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
  • Download Code

Replies are listed 'Best First'.
Re^2: Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
by choroba (Cardinal) on May 04, 2011 at 12:11 UTC
    UTF-16 is very easy to parse. Whatever is reading the string just blindly reads two byte chunks (16 bits) from the string, and whatever is in those two bytes is a Unicode integer.
    What you describe here is UCS-2. See utf-16
      Doh. Thanks for correcting me.
Re^2: Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
by ikegami (Patriarch) on May 05, 2011 at 06:51 UTC

    When you convert from UTF-8 (or UTF-16 or any other 'encoding') to Unicode, you are decoding. When you convert from Unicode to UTF-8, you are encoding. Unicode is an integer like 8634

    Same, but cleaned up a bit:

    When you convert from UTF-8 (or UTF-16 or any other 'encoding') to Unicode, you are decoding. When you convert from Unicode to UTF-8, you are encoding. A Unicode string consists of code points, integers like 8634.

    UTF-16 is very easy to parse. Whatever is reading the string just blindly reads two byte chunks (16 bits) from the string, and whatever is in those two bytes is a Unicode integer. However, UTF-8 is a tricky encoding. It use from 1-4 bytes to store a Unicode integer.

    UTF-16le and UTF-16be are variable-length encodings just like UTF-8. There are 0x110000 Unicode code points (though most aren't assigned), and that doesn't fit in 16-bits. A UTF-16 code point can take 2 or 4 bytes. For example, the UTF-16be encoding of U+10000 is bytes D8 00 DC 00.

    UCS-2le and UCS-2be are fixed-width encodings, but they can only encode a subset of Unicode (code points zero to 0xFFFF).