Re: Re: Re: Re: Re: How are regex character classes implemented?

And don't forget

UCS-4: “each encoded character is represented in a 32-bit quantity within a code space 0..7FFFFFFF”

Unicode defines a space of 0x10FFFF code points, but ISO 646 defines a space of 0x7fffffff, or 31 bit values. However, the highest plane is already for private use, and they promised to assign real codes starting from the bottom, so the smaller domain of Unicode should not be a problem until they actually run out.

Yes, there is a big difference between the code points and the encodings. A capital 'A' is the value 65. How you store the 65 in your program is beside the point. It could be a 7-bit integer, a 64-bit integer, a floating-point number, a string of EBCDIC digits, Huffman-encoded variable-length fields, or whatever.

UTF-8 is great for the reasons you list, and for a few others: it's a strict superset of ASCII, and it's byte-order neutral.

—John

Comment on Re: Re: Re: Re: Re: How are regex character classes implemented?