How are regex character classes implemented?

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How are regex character classes implemented? by japhy (Canon) on Jul 18, 2002 at 18:50 UTC
Yes, Unicode fouled things up. Perl uses the utf8.c source file, and the main "is this character a member of this Unicode class" function is `swash_fetch()`. I don't understand it, honestly. Unicode gives me the chills. The bad chills. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re: How are regex character classes implemented? by Elian (Parson) on Jul 18, 2002 at 18:39 UTC
For Unicode (and a 64K bitmap's insufficient, there's more than 64K characters at this point) you can use a two or three level tree and compress away entire unused branches. There's some discussion of this in the Unicode spec, which is available online from The Unicode Consortium Or go dig up PCRE, or IBM's ICU library.	[reply]
Re: Re: How are regex character classes implemented? by John M. Dlugosz (Monsignor) on Jul 18, 2002 at 20:05 UTC
The Unicode Technical Report #18 discusses things a Unicode regex engine needs to do, but doesn't give any ideas on how to implement one. Do you know where it discusses the tree approach?	[reply]
Re: Re: Re: How are regex character classes implemented? by Elian (Parson) on Jul 19, 2002 at 03:19 UTC
It's in the main Unicode book itself, though I'd have to go dig through the docs. It's not in the discussion of regular expressions--IIRC it's in there in the discussion of character properties, but it's been a while.	[reply]
Re: How are regex character classes implemented? by kvale (Monsignor) on Jul 18, 2002 at 18:48 UTC
The ASCII character classes are handled with bitmaps. From the 5.6.1 regcomp.c, the UTF8 classes are handled with RVs. Check out S_regclassutf8(pTHX) with the code `if (!SIZE_ONLY) { SV rv = swash_init("utf8", "", listsv, 1, 0); #ifdef DEBUGGING AV av = newAV(); av_push(av, rv); av_push(av, listsv); rv = newRV_noinc((SV)av); #else SvREFCNT_dec(listsv); #endif n = add_data(1,"s"); PL_regcomp_rx->data->data[n] = (void)rv; ARG1_SET(ret, flags); ARG2_SET(ret, n); }` [download] -Mark	[reply] [d/l]
•Re: How are regex character classes implemented? by merlyn (Sage) on Jul 18, 2002 at 21:31 UTC
Even if you give up Perl, you don't have to give up Perl-like regular expressions. It obviously can't be all of Perl's regular expressions (not without including a Perl interpreter), but it's a good approximation. -- Randal L. Schwartz, Perl hacker	[reply]
Re: •Re: How are regex character classes implemented? by John M. Dlugosz (Monsignor) on Jul 19, 2002 at 18:02 UTC
The program I'm discussing was "prototyped" in Perl to transform a native file into XML. Now that feature is being integrated into the main program, which is written in C++. I don't need general parsing/matching features, just a way to tell whether a string is a legal XML identifier. Too bad specifications like that don't list Unicode glyph database properties, rather than all the legal characters individually! Doing a good job of that is low priority, but interesting to me. I think last time I looked at PCRE (if it's the same library I saw before), it didn't handle Unicode. The one you point to mentions screwed-up experimental UTF-8 features, so maybe it's evolved. Thanks, as always. —John	[reply]
Re: How are regex character classes implemented? by seattlejohn (Deacon) on Jul 19, 2002 at 06:09 UTC
At the risk of engaging in premature optimization, I would be inclined to suggest tuning for the common case, i.e. 7-or 8-bit ASCII. Setting up a bitmap for that is obviously straightforward. In practical terms you probably have to go with the list-of-pairs approach for the rest of the character set, since Unicode code points extend up to 0x10FFFF and I doubt you want to maintain a (very sparse) 1.1-megabit bit map for each character class. And yes, that's 0x10FFFF, not 0xFFFF. Unicode is not limited to representing 64K characters, and something like 74K code points have actually been defined at this point. Short-term you could probably get away with only worrying about the lower 64K code points (the "basic multilingual plane"), but it's not really a compliant Unicode implementation if you go that route.	[reply]
Re: Re: How are regex character classes implemented? by John M. Dlugosz (Monsignor) on Jul 19, 2002 at 18:15 UTC
Yes, I was thinking that for a general-purpose component having sets that only contain smaller ranges are easily done with polymorphism (different representations for different cases), but it should also efficiently handle when the character to test is most-often in the ASCII/ANSI subrange, too. My favorite doodle right now is a class with 4 pointer members: small_low points to a 16-byte bitmap for 0..127, likewise small_high for 128..255. large points to an array of arrays that handle 16-bit characters, and huge points to a deeper chain of arrays of arrays for 31-bit characters. For testing, the small_low and small_high bitmaps are one pointer away, even if the large bitmap is populated. It doesn't have to chace down multiple levels after seeing that the high byte is zero. The previous favorite design is a list of numbers. Even entries are starting points of a range (inclusive) and odd entries are ending points (exclusive). Do a binary search in the array and see if it's part of an "on" range or an "off" range. That is good for large collections with few distinct runs, but uses a different algorithm than the ASCII case. The first one I mentioned is the same throughout, just different tree depths. —John	[reply]
Re: Re: How are regex character classes implemented? by theorbtwo (Prior) on Jul 19, 2002 at 06:14 UTC
Moreover, the limitation to "only" 0x10ffff codepoints isn't completly technical. utf8 can support up to 2(28) codepoints, IIRC. Confession: It does an Immortal Body good.	[reply]
Re: Re: Re: How are regex character classes implemented? by John M. Dlugosz (Monsignor) on Jul 19, 2002 at 18:06 UTC
UTF-8 supports 2**256 codepoints?! I don't think so. 11111110 is the largest byte-count the first byte can encode, so that's followed by 7 groups of 6 bits, or 42 bits total. The ISO-646 character set is defined on 31 bits. I guess they didn't want to worry about signed/unsigned, or perhaps wanted to leave a bit for the user? Anyway, it's certainly enough.	[reply]
Re: Re: Re: Re: How are regex character classes implemented? by seattlejohn (Deacon) on Jul 19, 2002 at 21:17 UTC
Re: Re: Re: Re: Re: How are regex character classes implemented? by John M. Dlugosz (Monsignor) on Jul 22, 2002 at 18:23 UTC
Re: How are regex character classes implemented? by ides (Deacon) on Jul 18, 2002 at 18:29 UTC
Have you looked into just using a C/C++ regex library in your code? There are several that I have seen around. ----------------------------------- Frank Wiles <frank@wiles.org> http://frank.wiles.org	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.