comment on

11111110 is the largest byte-count the first byte can encode, so that's followed by 7 groups of 6 bits, or 42 bits total.

If I understand the Unicode spec properly, there's an important distinction between Unicode code points (what we tend to think of as characters) and Unicode encodings, e.g. UTF-8. The current version of Unicode defines "only" 0x10FFFF code points or possible characters, which they claim should be more than enough to handle every character in every modern and historical language every written.

There are then a variety of transformation formats defined for representing Unicode code points as actual bytes/octets:

UTF-8: a variable-length encoding in which Unicode code points 0-127 (also ASCII chars 1-127) are represented by a single octet, and other code points are represented using from 2 to 6 octets. Used by Perl internally and also intended for places like HTML documents where reducing file size and transmission time for the common case is particularly desriable.
UTF-16: a two-octet encoding that can represent about 63K Unicode codepoints, including large numbers of the CJK (Chinese-Japanese-Korean) unified ideograms. Some octet values in UTF-16 are reserved for surrogate pairs, in which two sequential codepoints represent one of the codepoints larger than 0xFFFF.
UTF-32: a four-octet encoding scheme that represents every Unicode codepoint without any form of escaping or surrogates. This does not, however, mean that there are actually 2^32 possible Unicode codepoints -- despite having 32 bits to work with, UTF-32 values larger than 0x10FFFF are explicitly illegal. (See here.)

In reply to Re: Re: Re: Re: How are regex character classes implemented? by seattlejohn
in thread How are regex character classes implemented? by John M. Dlugosz

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.