in reply to Re: Re: Re: How are regex character classes implemented?
in thread How are regex character classes implemented?

11111110 is the largest byte-count the first byte can encode, so that's followed by 7 groups of 6 bits, or 42 bits total.

If I understand the Unicode spec properly, there's an important distinction between Unicode code points (what we tend to think of as characters) and Unicode encodings, e.g. UTF-8. The current version of Unicode defines "only" 0x10FFFF code points or possible characters, which they claim should be more than enough to handle every character in every modern and historical language every written.

There are then a variety of transformation formats defined for representing Unicode code points as actual bytes/octets:

  • Comment on Re: Re: Re: Re: How are regex character classes implemented?

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: How are regex character classes implemented?
by John M. Dlugosz (Monsignor) on Jul 22, 2002 at 18:23 UTC
    And don't forget
    • UCS-4: “each encoded character is represented in a 32-bit quantity within a code space 0..7FFFFFFF”
    Unicode defines a space of 0x10FFFF code points, but ISO 646 defines a space of 0x7fffffff, or 31 bit values. However, the highest plane is already for private use, and they promised to assign real codes starting from the bottom, so the smaller domain of Unicode should not be a problem until they actually run out.

    Yes, there is a big difference between the code points and the encodings. A capital 'A' is the value 65. How you store the 65 in your program is beside the point. It could be a 7-bit integer, a 64-bit integer, a floating-point number, a string of EBCDIC digits, Huffman-encoded variable-length fields, or whatever.

    UTF-8 is great for the reasons you list, and for a few others: it's a strict superset of ASCII, and it's byte-order neutral.

    —John