drewmate has asked for the wisdom of the Perl Monks concerning the following question:

I need to convert any non-Ascii characters in a string into their escaped unicode forms (like \u00E3) for use in a program that requires this form. I am using the Unicode::Escape module (specifically the escape function) and it was working out nicely:

$entry[5] = Unicode::Escape::escape($val);

This seemed to work for Latin languages with accents, and Russian (Cyrillic characters) but when I tested it with Japanese characters it gave me the error:

Cannot decode string with wide characters at C:/Perl/lib/Encode.pm line 194.

I'm not sure why I am getting this error, since I'm not trying to decode anything. I'm trying to escape some Japanese characters into escaped Unicode code points. I think the problem might be because the Escape routine is expecting UTF8 and the Japanese is in UTF16 or something. I'm not entirely sure... The text comes from an excel document which I open with Spreadsheet::Read, and I'm not sure what that is doing with it.

Anyone have any tips on how to figure this one out?
  • Comment on Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
  • Download Code

Replies are listed 'Best First'.
Re: Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
by graff (Chancellor) on May 04, 2011 at 03:11 UTC
    If you really are not sure what encoding is being used by your input data, and especially if you're dealing with any of the Asian languages, you should look at Encode::Guess.
Re: Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
by 7stud (Deacon) on May 04, 2011 at 00:55 UTC

    I need to convert any non-Ascii characters in a string into their escaped unicode forms (like \u00E3)...

    I'm not sure why I am getting this error, since I'm not trying to decode anything.

    When you convert from UTF-8 (or UTF-16 or any other 'encoding') to Unicode, you are decoding. When you convert from Unicode to UTF-8, you are encoding. Unicode is an integer like 8634, and writing that integer in hex format (rather than decimal) does not change the fact that it is a Unicode integer. The '\u' says, "Hey, what follows is a Unicode integer in hex format." The decision about how many bytes you want to use to store that Unicode integer in a string is the decision about which encoding to use.

    I think the problem might be because the Escape routine is expecting UTF8 and the Japanese is in UTF16 or something. I'm not entirely sure...

    If you don't know what encoding a string has, you can't convert it to unicode.

    UTF-16 is very easy to parse. Whatever is reading the string just blindly reads two byte chunks (16 bits) from the string, and whatever is in those two bytes is a Unicode integer. However, UTF-8 is a tricky encoding. It use from 1-4 bytes to store a Unicode integer. In order to let whatever is reading the string know how many bytes to read for each Unicode integer, UTF-8 uses special markers at the end of each sequence of bytes. UTF-16 doesn't need any special markers because every Unicode integer is stored in two bytes, so whatever is reading the string just reads two bytes at a time.

    Now suppose a string is encoded in UTF-16, but the program reading the string is expecting UTF-8. The string reader will start reading bytes and continue until it finds a special marker to notify it that the end of a Unicode integer has been reached. But because the string is encoded in UTF-16, those special markers won't exist.

    Here is a concreate example:

    0000 0001 0001

    If you know the Unicode integer is stored in the first byte(8 bits), then you know that the Unicode integer is: 0000 0001, which is 1 in decimal. However, if the Unicode integer is stored in the first 3 bytes, then the Unicode integer is 17 (=1*16 + 1*1).

    In short, unless you tell a program what it should be looking for when reading a string(=the encoding), then the program can't know how many bytes to read for each Unicode integer stored in the string. Remember, a computer can only store numbers, so Unicode integers are actually codes for characters.

      UTF-16 is very easy to parse. Whatever is reading the string just blindly reads two byte chunks (16 bits) from the string, and whatever is in those two bytes is a Unicode integer.
      What you describe here is UCS-2. See utf-16
        Doh. Thanks for correcting me.

      When you convert from UTF-8 (or UTF-16 or any other 'encoding') to Unicode, you are decoding. When you convert from Unicode to UTF-8, you are encoding. Unicode is an integer like 8634

      Same, but cleaned up a bit:

      When you convert from UTF-8 (or UTF-16 or any other 'encoding') to Unicode, you are decoding. When you convert from Unicode to UTF-8, you are encoding. A Unicode string consists of code points, integers like 8634.

      UTF-16 is very easy to parse. Whatever is reading the string just blindly reads two byte chunks (16 bits) from the string, and whatever is in those two bytes is a Unicode integer. However, UTF-8 is a tricky encoding. It use from 1-4 bytes to store a Unicode integer.

      UTF-16le and UTF-16be are variable-length encodings just like UTF-8. There are 0x110000 Unicode code points (though most aren't assigned), and that doesn't fit in 16-bits. A UTF-16 code point can take 2 or 4 bytes. For example, the UTF-16be encoding of U+10000 is bytes D8 00 DC 00.

      UCS-2le and UCS-2be are fixed-width encodings, but they can only encode a subset of Unicode (code points zero to 0xFFFF).

Re: Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding!
by Anonymous Monk on May 04, 2011 at 00:27 UTC