I need to convert any non-Ascii characters in a string into their escaped unicode forms (like \u00E3)...

I'm not sure why I am getting this error, since I'm not trying to decode anything.

When you convert from UTF-8 (or UTF-16 or any other 'encoding') to Unicode, you are decoding. When you convert from Unicode to UTF-8, you are encoding. Unicode is an integer like 8634, and writing that integer in hex format (rather than decimal) does not change the fact that it is a Unicode integer. The '\u' says, "Hey, what follows is a Unicode integer in hex format." The decision about how many bytes you want to use to store that Unicode integer in a string is the decision about which encoding to use.

I think the problem might be because the Escape routine is expecting UTF8 and the Japanese is in UTF16 or something. I'm not entirely sure...

If you don't know what encoding a string has, you can't convert it to unicode.

UTF-16 is very easy to parse. Whatever is reading the string just blindly reads two byte chunks (16 bits) from the string, and whatever is in those two bytes is a Unicode integer. However, UTF-8 is a tricky encoding. It use from 1-4 bytes to store a Unicode integer. In order to let whatever is reading the string know how many bytes to read for each Unicode integer, UTF-8 uses special markers at the end of each sequence of bytes. UTF-16 doesn't need any special markers because every Unicode integer is stored in two bytes, so whatever is reading the string just reads two bytes at a time.

Now suppose a string is encoded in UTF-16, but the program reading the string is expecting UTF-8. The string reader will start reading bytes and continue until it finds a special marker to notify it that the end of a Unicode integer has been reached. But because the string is encoded in UTF-16, those special markers won't exist.

Here is a concreate example:

0000 0001 0001

If you know the Unicode integer is stored in the first byte(8 bits), then you know that the Unicode integer is: 0000 0001, which is 1 in decimal. However, if the Unicode integer is stored in the first 3 bytes, then the Unicode integer is 17 (=1*16 + 1*1).

In short, unless you tell a program what it should be looking for when reading a string(=the encoding), then the program can't know how many bytes to read for each Unicode integer stored in the string. Remember, a computer can only store numbers, so Unicode integers are actually codes for characters.


In reply to Re: Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding! by 7stud
in thread Cannot decode string with wide charactersCannot decode string with wide characters - I'm not decoding! by drewmate

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.