Re: How are regex character classes implemented?
by japhy (Canon) on Jul 18, 2002 at 18:50 UTC
|
Yes, Unicode fouled things up. Perl uses the utf8.c source file, and the main "is this character a member of this Unicode class" function is swash_fetch().
I don't understand it, honestly. Unicode gives me the chills. The bad chills.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] |
Re: How are regex character classes implemented?
by Elian (Parson) on Jul 18, 2002 at 18:39 UTC
|
For Unicode (and a 64K bitmap's insufficient, there's more than 64K characters at this point) you can use a two or three level tree and compress away entire unused branches. There's some discussion of this in the Unicode spec, which is available online from The Unicode Consortium
Or go dig up PCRE, or IBM's ICU library. | [reply] |
|
|
The Unicode Technical Report #18 discusses things a Unicode regex engine needs to do, but doesn't give any ideas on how to implement one. Do you know where it discusses the tree approach?
| [reply] |
|
|
It's in the main Unicode book itself, though I'd have to go dig through the docs. It's not in the discussion of regular expressions--IIRC it's in there in the discussion of character properties, but it's been a while.
| [reply] |
Re: How are regex character classes implemented?
by kvale (Monsignor) on Jul 18, 2002 at 18:48 UTC
|
The ASCII character classes are handled with bitmaps.
From the 5.6.1 regcomp.c, the
UTF8 classes are handled with RVs. Check out S_regclassutf8(pTHX) with the code
if (!SIZE_ONLY) {
SV *rv = swash_init("utf8", "", listsv, 1, 0);
#ifdef DEBUGGING
AV *av = newAV();
av_push(av, rv);
av_push(av, listsv);
rv = newRV_noinc((SV*)av);
#else
SvREFCNT_dec(listsv);
#endif
n = add_data(1,"s");
PL_regcomp_rx->data->data[n] = (void*)rv;
ARG1_SET(ret, flags);
ARG2_SET(ret, n);
}
-Mark | [reply] [d/l] |
•Re: How are regex character classes implemented?
by merlyn (Sage) on Jul 18, 2002 at 21:31 UTC
|
| [reply] |
|
|
The program I'm discussing was "prototyped" in Perl to transform a native file into XML. Now that feature is being integrated into the main program, which is written in C++. I don't need general parsing/matching features, just a way to tell whether a string is a legal XML identifier.
Too bad specifications like that don't list Unicode glyph database properties, rather than all the legal characters individually!
Doing a good job of that is low priority, but interesting to me.
I think last time I looked at PCRE (if it's the same library I saw before), it didn't handle Unicode. The one you point to mentions screwed-up experimental UTF-8 features, so maybe it's evolved.
Thanks, as always.
—John
| [reply] |
Re: How are regex character classes implemented?
by seattlejohn (Deacon) on Jul 19, 2002 at 06:09 UTC
|
At the risk of engaging in premature optimization, I would be inclined to suggest tuning for the common case, i.e. 7-or 8-bit ASCII. Setting up a bitmap for that is obviously straightforward. In practical terms you probably have to go with the list-of-pairs approach for the rest of the character set, since Unicode code points extend up to 0x10FFFF and I doubt you want to maintain a (very sparse) 1.1-megabit bit map for each character class.
And yes, that's 0x10FFFF, not 0xFFFF. Unicode is not limited to representing 64K characters, and something like 74K code points have actually been defined at this point. Short-term you could probably get away with only worrying about the lower 64K code points (the "basic multilingual plane"), but it's not really a compliant Unicode implementation if you go that route. | [reply] |
|
|
Yes, I was thinking that for a general-purpose component having sets that only contain smaller ranges are easily done with polymorphism (different representations for different cases), but it should also efficiently handle when the character to test is most-often in the ASCII/ANSI subrange, too.
My favorite doodle right now is a class with 4 pointer members: small_low points to a 16-byte bitmap for 0..127, likewise small_high for 128..255. large points to an array of arrays that handle 16-bit characters, and huge points to a deeper chain of arrays of arrays for 31-bit characters.
For testing, the small_low and small_high bitmaps are one pointer away, even if the large bitmap is populated. It doesn't have to chace down multiple levels after seeing that the high byte is zero.
The previous favorite design is a list of numbers. Even entries are starting points of a range (inclusive) and odd entries are ending points (exclusive). Do a binary search in the array and see if it's part of an "on" range or an "off" range.
That is good for large collections with few distinct runs, but uses a different algorithm than the ASCII case. The first one I mentioned is the same throughout, just different tree depths.
—John
| [reply] |
|
|
| [reply] |
|
|
UTF-8 supports 2**256 codepoints?! I don't think so.
11111110 is the largest byte-count the first byte can encode, so that's followed by 7 groups of 6 bits, or 42 bits total.
The ISO-646 character set is defined on 31 bits. I guess they didn't want to worry about signed/unsigned, or perhaps wanted to leave a bit for the user? Anyway, it's certainly enough.
| [reply] |
|
|
|
|
Re: How are regex character classes implemented?
by ides (Deacon) on Jul 18, 2002 at 18:29 UTC
|
| [reply] |
| A reply falls below the community's threshold of quality. You may see it by logging in. |