in reply to Re: Re: How are regex character classes implemented?
in thread How are regex character classes implemented?

UTF-8 supports 2**256 codepoints?! I don't think so.

11111110 is the largest byte-count the first byte can encode, so that's followed by 7 groups of 6 bits, or 42 bits total.

The ISO-646 character set is defined on 31 bits. I guess they didn't want to worry about signed/unsigned, or perhaps wanted to leave a bit for the user? Anyway, it's certainly enough.

  • Comment on Re: Re: Re: How are regex character classes implemented?

Replies are listed 'Best First'.
Re: Re: Re: Re: How are regex character classes implemented?
by seattlejohn (Deacon) on Jul 19, 2002 at 21:17 UTC
    11111110 is the largest byte-count the first byte can encode, so that's followed by 7 groups of 6 bits, or 42 bits total.

    If I understand the Unicode spec properly, there's an important distinction between Unicode code points (what we tend to think of as characters) and Unicode encodings, e.g. UTF-8. The current version of Unicode defines "only" 0x10FFFF code points or possible characters, which they claim should be more than enough to handle every character in every modern and historical language every written.

    There are then a variety of transformation formats defined for representing Unicode code points as actual bytes/octets:

    • UTF-8: a variable-length encoding in which Unicode code points 0-127 (also ASCII chars 1-127) are represented by a single octet, and other code points are represented using from 2 to 6 octets. Used by Perl internally and also intended for places like HTML documents where reducing file size and transmission time for the common case is particularly desriable.
    • UTF-16: a two-octet encoding that can represent about 63K Unicode codepoints, including large numbers of the CJK (Chinese-Japanese-Korean) unified ideograms. Some octet values in UTF-16 are reserved for surrogate pairs, in which two sequential codepoints represent one of the codepoints larger than 0xFFFF.
    • UTF-32: a four-octet encoding scheme that represents every Unicode codepoint without any form of escaping or surrogates. This does not, however, mean that there are actually 2^32 possible Unicode codepoints -- despite having 32 bits to work with, UTF-32 values larger than 0x10FFFF are explicitly illegal. (See here.)
      And don't forget
      • UCS-4: “each encoded character is represented in a 32-bit quantity within a code space 0..7FFFFFFF”
      Unicode defines a space of 0x10FFFF code points, but ISO 646 defines a space of 0x7fffffff, or 31 bit values. However, the highest plane is already for private use, and they promised to assign real codes starting from the bottom, so the smaller domain of Unicode should not be a problem until they actually run out.

      Yes, there is a big difference between the code points and the encodings. A capital 'A' is the value 65. How you store the 65 in your program is beside the point. It could be a 7-bit integer, a 64-bit integer, a floating-point number, a string of EBCDIC digits, Huffman-encoded variable-length fields, or whatever.

      UTF-8 is great for the reasons you list, and for a few others: it's a strict superset of ASCII, and it's byte-order neutral.

      —John