John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

A while back, I made some XML-related Regular Expressions for testing to see if a string is a legal XML identifier.

Easy to do in Perl, once you list all the ranges. Let the regex engine do the rest.

But now I need to do that in C++. I've been thinking of different ways to do it, ballancing both high-speed and compact representation.

A 64K bitmap (8K bytes) would be a simple lookup, but rather large. A list of pairs would be compact but slower to search. I'm leaning toward a hiarachial table of bitmaps.

So... how does Perl5 do it? In the old days, a 256-bit bitmap was a great character class implementation. But with Unicode, things change. I wonder if anyone knows what ideas were bantered around and dismissed, and why the implementors like the way they settled on (if they do)? Will Perl6 do it the same way?

—John

  • Comment on How are regex character classes implemented?

Replies are listed 'Best First'.
Re: How are regex character classes implemented?
by japhy (Canon) on Jul 18, 2002 at 18:50 UTC
    Yes, Unicode fouled things up. Perl uses the utf8.c source file, and the main "is this character a member of this Unicode class" function is swash_fetch().

    I don't understand it, honestly. Unicode gives me the chills. The bad chills.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: How are regex character classes implemented?
by Elian (Parson) on Jul 18, 2002 at 18:39 UTC
    For Unicode (and a 64K bitmap's insufficient, there's more than 64K characters at this point) you can use a two or three level tree and compress away entire unused branches. There's some discussion of this in the Unicode spec, which is available online from The Unicode Consortium

    Or go dig up PCRE, or IBM's ICU library.

      The Unicode Technical Report #18 discusses things a Unicode regex engine needs to do, but doesn't give any ideas on how to implement one. Do you know where it discusses the tree approach?

        It's in the main Unicode book itself, though I'd have to go dig through the docs. It's not in the discussion of regular expressions--IIRC it's in there in the discussion of character properties, but it's been a while.
Re: How are regex character classes implemented?
by kvale (Monsignor) on Jul 18, 2002 at 18:48 UTC
    The ASCII character classes are handled with bitmaps.

    From the 5.6.1 regcomp.c, the UTF8 classes are handled with RVs. Check out S_regclassutf8(pTHX) with the code

    if (!SIZE_ONLY) { SV *rv = swash_init("utf8", "", listsv, 1, 0); #ifdef DEBUGGING AV *av = newAV(); av_push(av, rv); av_push(av, listsv); rv = newRV_noinc((SV*)av); #else SvREFCNT_dec(listsv); #endif n = add_data(1,"s"); PL_regcomp_rx->data->data[n] = (void*)rv; ARG1_SET(ret, flags); ARG2_SET(ret, n); }
    -Mark
•Re: How are regex character classes implemented?
by merlyn (Sage) on Jul 18, 2002 at 21:31 UTC
      The program I'm discussing was "prototyped" in Perl to transform a native file into XML. Now that feature is being integrated into the main program, which is written in C++. I don't need general parsing/matching features, just a way to tell whether a string is a legal XML identifier.

      Too bad specifications like that don't list Unicode glyph database properties, rather than all the legal characters individually!

      Doing a good job of that is low priority, but interesting to me.

      I think last time I looked at PCRE (if it's the same library I saw before), it didn't handle Unicode. The one you point to mentions screwed-up experimental UTF-8 features, so maybe it's evolved.

      Thanks, as always.

      —John

Re: How are regex character classes implemented?
by seattlejohn (Deacon) on Jul 19, 2002 at 06:09 UTC
    At the risk of engaging in premature optimization, I would be inclined to suggest tuning for the common case, i.e. 7-or 8-bit ASCII. Setting up a bitmap for that is obviously straightforward. In practical terms you probably have to go with the list-of-pairs approach for the rest of the character set, since Unicode code points extend up to 0x10FFFF and I doubt you want to maintain a (very sparse) 1.1-megabit bit map for each character class.

    And yes, that's 0x10FFFF, not 0xFFFF. Unicode is not limited to representing 64K characters, and something like 74K code points have actually been defined at this point. Short-term you could probably get away with only worrying about the lower 64K code points (the "basic multilingual plane"), but it's not really a compliant Unicode implementation if you go that route.

      Yes, I was thinking that for a general-purpose component having sets that only contain smaller ranges are easily done with polymorphism (different representations for different cases), but it should also efficiently handle when the character to test is most-often in the ASCII/ANSI subrange, too.

      My favorite doodle right now is a class with 4 pointer members: small_low points to a 16-byte bitmap for 0..127, likewise small_high for 128..255. large points to an array of arrays that handle 16-bit characters, and huge points to a deeper chain of arrays of arrays for 31-bit characters.

      For testing, the small_low and small_high bitmaps are one pointer away, even if the large bitmap is populated. It doesn't have to chace down multiple levels after seeing that the high byte is zero.

      The previous favorite design is a list of numbers. Even entries are starting points of a range (inclusive) and odd entries are ending points (exclusive). Do a binary search in the array and see if it's part of an "on" range or an "off" range.

      That is good for large collections with few distinct runs, but uses a different algorithm than the ASCII case. The first one I mentioned is the same throughout, just different tree depths.

      —John

      Moreover, the limitation to "only" 0x10ffff codepoints isn't completly technical. utf8 can support up to 2**(2**8) codepoints, IIRC.


      Confession: It does an Immortal Body good.

        UTF-8 supports 2**256 codepoints?! I don't think so.

        11111110 is the largest byte-count the first byte can encode, so that's followed by 7 groups of 6 bits, or 42 bits total.

        The ISO-646 character set is defined on 31 bits. I guess they didn't want to worry about signed/unsigned, or perhaps wanted to leave a bit for the user? Anyway, it's certainly enough.

Re: How are regex character classes implemented?
by ides (Deacon) on Jul 18, 2002 at 18:29 UTC
    Have you looked into just using a C/C++ regex library in your code? There are several that I have seen around.

    -----------------------------------
    Frank Wiles <frank@wiles.org>
    http://frank.wiles.org

A reply falls below the community's threshold of quality. You may see it by logging in.