65 is a number. Whether you represent it as "65" the string (which is {0x36 0x35 0} as a C string literal), a twos-complement integer big-endian or little endian, or a 8-byte floating point number, or a BCD class, or whatever, it's still the number sixty five.
That is, the value is distinct from the representation.
UTF-8 is a representation. It explains how to take integers in the range 0..231-1 and encode them in a stream of bytes for interchange.
The "code point" is 65 for capital A. Whether you want to store that as a byte, a nul-terminate string of 16-bit characters, a floating point value, or whatever, it still means the letter A to anyone who can read the representation.
There is no lookup table in that code, as you noted, because it is changing representations only, not mapping any code points.
UTF-8 is a variable-length coding system. For values that fit into 7 bits, leave the high bit 0 and emit one byte. So, all legal ASCII is also legal UTF-8! That's a designed-in feature.
For values that fit in 11 bits, emit 110xxxxx 10yyyyyy as two bytes, where the original number in binary is xxxxxyyyyy. A subset of that gives you 110000xx for the first byte if the value fits in 8 bits. That's what your code is looking for.
—John
P.S. how about joining the forum?
In reply to Re: Re: Re: Re: Re: regex for utf-8
by John M. Dlugosz
in thread regex for utf-8
by jjohhn
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |