comment on

Ok, after some more research, I think I have a better understanding of the situation, forgive me if I am stating the obvious, but this is for the chaps like myself. UTF-8 is not a character set, it is an encoding method for use with the UCS/Unicode character set which is a multi-byte charset. ISO-8859-1 is a Superset of US-ASCII (i.e. a single byte character set), though it is not an encoding method per se. In that these character sets map to single bytes so no magical encoding has to be done. The way UTF-8 works is thus:

UCS characters U+0000-U+007F are encoded as simple bytes, this allows for ASCII compatability
All UCS characters >U+007F are encoded as a sequence of bytes with their most significant byte set.
The first byte in a multibyte sequence is always in the range of 0xC0-0xFD, and indicates how many bytes follow for this character. All further bytes in the same sequence are in the range of 0x80-0xBF
All possible 2³¹ UCS codes can be encoded
The bytes 0xFE & 0xFF are never used in UTF-8 encoding

The following table describes the byte sequences used to represent a character.

Unicode/UCS number Byte Sequence

U+00000000-U+0000007F 0xxxxxxxx

U+00000080-U+000007FF 110xxxxx 10xxxxxx

U+00000800-U+0000FFFF 1110xxxx 10xxxxxx 10xxxxxx

U+00010000-U+001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U+00200000-U+03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U+04000000-U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Unicode/UCS number	Byte Sequence
U+00000000-U+0000007F	0xxxxxxxx
U+00000080-U+000007FF	110xxxxx 10xxxxxx
U+00000800-U+0000FFFF	1110xxxx 10xxxxxx 10xxxxxx
U+00010000-U+001FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U+00200000-U+03FFFFFF	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U+04000000-U+7FFFFFFF	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The x bit positions are filled with the bits of the character's number in binary. The rightmost bit is the least-significant. Note that the number of leading one bits in the first byte is identical to the total number of bytes in the sequence.

For example: The U+000000F6 (LATIN SMALL LETTER O WITH DIAERESIS 'ö') = 1111 0110
Since 0xF6 is greater than 0x7F UTF-8 uses the second row of the above table to encode this character.

110XXXXX 10XXXXXX = 0xC0 0X80
11000011 10110110 = 0xC3 0xB6
[download]

This explains how %F6 is transcoded to %C3%B6. CGI.pm is placing single byte characters from the ISO-8859-1 characterset in place of the unicode two-byte character, which is expected. I can also run the string through a UTF-8 decoder and it will display the proper character, however if I display the string undecoded back to the browser, in UTF-8 mode it shows up as the wrong character (a chinese character). I expect if I want to process the string in perl and have the proper character in the string I would have to decode the two-bytes using a utf-8 decoder. However, I would not expect to have to decode the string, if I were just going to turn around and display it back to the browser which is in UTF-8 'mode'. Though when I decode the string it does display in the browser properly.

Note:My source for all this new found UCS/Unicode knowledge came from http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs and some portions were copy and pasted, while others were paraphrased. Thanks to Markus Kuhn for his wonderful resource.

In reply to Re: UTF-8 and URL encoding by linux454
in thread UTF-8 and URL encoding by linux454

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.