Ok, after some more research, I think I have a better understanding of the situation, forgive me if I am stating the obvious, but this is for the chaps like myself. UTF-8 is not a character set, it is an encoding method for use with the UCS/Unicode character set which is a multi-byte charset. ISO-8859-1 is a Superset of US-ASCII (i.e. a single byte character set), though it is not an encoding method per se. In that these character sets map to single bytes so no magical encoding has to be done. The way UTF-8 works is thus:

The following table describes the byte sequences used to represent a character.
Unicode/UCS numberByte Sequence
U+00000000-U+0000007F0xxxxxxxx
U+00000080-U+000007FF110xxxxx 10xxxxxx
U+00000800-U+0000FFFF1110xxxx 10xxxxxx 10xxxxxx
U+00010000-U+001FFFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U+00200000-U+03FFFFFF111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U+04000000-U+7FFFFFFF1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The x bit positions are filled with the bits of the character's number in binary. The rightmost bit is the least-significant. Note that the number of leading one bits in the first byte is identical to the total number of bytes in the sequence.

For example: The U+000000F6 (LATIN SMALL LETTER O WITH DIAERESIS 'ö') = 1111 0110
Since 0xF6 is greater than 0x7F UTF-8 uses the second row of the above table to encode this character.

110XXXXX 10XXXXXX = 0xC0 0X80 11000011 10110110 = 0xC3 0xB6

This explains how %F6 is transcoded to %C3%B6. CGI.pm is placing single byte characters from the ISO-8859-1 characterset in place of the unicode two-byte character, which is expected. I can also run the string through a UTF-8 decoder and it will display the proper character, however if I display the string undecoded back to the browser, in UTF-8 mode it shows up as the wrong character (a chinese character). I expect if I want to process the string in perl and have the proper character in the string I would have to decode the two-bytes using a utf-8 decoder. However, I would not expect to have to decode the string, if I were just going to turn around and display it back to the browser which is in UTF-8 'mode'. Though when I decode the string it does display in the browser properly.

Note:My source for all this new found UCS/Unicode knowledge came from http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs and some portions were copy and pasted, while others were paraphrased. Thanks to Markus Kuhn for his wonderful resource.


In reply to Re: UTF-8 and URL encoding by linux454
in thread UTF-8 and URL encoding by linux454

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.