The following table describes the byte sequences used to represent a character.
| Unicode/UCS number | Byte Sequence |
|---|---|
| U+00000000-U+0000007F | 0xxxxxxxx |
| U+00000080-U+000007FF | 110xxxxx 10xxxxxx |
| U+00000800-U+0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
| U+00010000-U+001FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
| U+00200000-U+03FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
| U+04000000-U+7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
The x bit positions are filled with the bits of the character's number in binary. The rightmost bit is the least-significant. Note that the number of leading one bits in the first byte is identical to the total number of bytes in the sequence.
For example:
The U+000000F6 (LATIN SMALL LETTER O WITH DIAERESIS 'ö') = 1111 0110
Since 0xF6 is greater than 0x7F UTF-8 uses the second row of the above table to encode this character.
110XXXXX 10XXXXXX = 0xC0 0X80 11000011 10110110 = 0xC3 0xB6
This explains how %F6 is transcoded to %C3%B6. CGI.pm is placing single byte characters from the ISO-8859-1 characterset in place of the unicode two-byte character, which is expected. I can also run the string through a UTF-8 decoder and it will display the proper character, however if I display the string undecoded back to the browser, in UTF-8 mode it shows up as the wrong character (a chinese character). I expect if I want to process the string in perl and have the proper character in the string I would have to decode the two-bytes using a utf-8 decoder. However, I would not expect to have to decode the string, if I were just going to turn around and display it back to the browser which is in UTF-8 'mode'. Though when I decode the string it does display in the browser properly.
Note:My source for all this new found UCS/Unicode knowledge came from http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs and some portions were copy and pasted, while others were paraphrased. Thanks to Markus Kuhn for his wonderful resource.
In reply to Re: UTF-8 and URL encoding
by linux454
in thread UTF-8 and URL encoding
by linux454
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |