(Ovid) Re: Regexp Speed Concerns
by Ovid (Cardinal) on Jan 12, 2001 at 10:42 UTC
|
As has been commented previously, you should read Mastering Regular Expressions. It is the definitive work on regexes and has no peer. Unfortunately, it is rather out of date, but the author, Jeffrey Friedl is writing a third edition which I understand should be out shortly.
If you are familiar with regular expressions, don't make the mistake of thinking that all regular expression engines are the same. They fall into three general categories:
- DFA. Fast, but does not allow back references (e.g. capture to $1). I understand that someone is writing a DFA engine that can do backreferences, but I'm too lazy to look up the reference right now :)
- Traditional NFA. This is the regex engine that Perl uses and, amongst "those in the know", is often considered the best. It's fairly fast and allows backreferences.
- POSIX NFA. Every once in a while, a standards body comes out with a poor standard. This is one of them. POSIX NFA usually tries for the "longest" match and, as such, can take an incredible amount of time with some regular expressions. Traditional NFA usually goes for the first match, thus allowing greater speed.
Between POSIX and traditional NFA engines, both can take long time verifying a failed match (sometimes longer than the than the length of time the universe has existed!), but traditional NFA tends to be much faster on successful matches.
Optimizing regular expressions, however, takes a lot of work and is far beyond the scope of this post. One thing you can do is read my shameless plug. However, the issues raised in that post aren't terribly relevant to what you're doing here.
The issue you are having is the following: when dealing with choices for single character, use a character class, but don't alternate!!! The following seem equivalent, but they're not:
$string =~ /[abc]/; # character class
$string =~ /a|b|c/; # alternation
It is almost guaranteed that the first will run faster. The reason for this is simple: when dealing with a character class, Perl's regex engine kind of creates a "sieve" that the characters pass through. If it doesn't go through the "abc" sieve, it doesn't pass. When you alternate, as in the second example, Perl's regex engine attempts to match the 'a'. If that fails, backs up to the end of the last successful match and tries the 'b'. If that fails, it again notes the end of the last successful match and tries 'c'. In short, it's forced to mark its position and continuously retry the match, but the character class simply allows 'a', 'b', or 'c' to pass.
The alternation problem is worsened when dealing with multiple characters.
$string =~ /bob/ || /alice/; # Good
$string =~ /(?:bob|alice)/; # Bad
The first example will simply try to match the target strings. The second example will keep trying to match, move forward a character after the failure, try to match again, etc, etc (no, this isn't quite accurate, but I'm trying to be brief). This is almost guaranteed to be slower. However, these speed issues are more of a concern when iterating over data. If it's a single shot regex, it's usually not that big of a consideration.
Regardless of all of the drivel here, it's important to take into consideration the actual data to be used. Testing against hypothetical data that really doesn't match what you are going to use is a poor test of the efficiency of your solution (this is just an aside. I have no idea if it's directly applicable to what you're doing). However, if you create sample data that closely mimics what you want, benchmarking your data is much more likely to produce relevant results.
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. | [reply] [d/l] [select] |
Re: Regexp Speed Concerns
by tilly (Archbishop) on Jan 12, 2001 at 08:02 UTC
|
| [reply] |
Re: Regexp Speed Concerns
by extremely (Priest) on Jan 12, 2001 at 07:40 UTC
|
| [reply] |
Re: Regexp Speed Concerns
by Hrunting (Pilgrim) on Jan 12, 2001 at 07:50 UTC
|
I've never looked at the code that perl uses to handle regexes, both compiling and optimizing, but from just looking at it and talking through it, I can begin to see why you get different results:
1: look for at, not found, look for bt, not found, look for ct ...
2: look for [abcdef], found, look for t, not found, look for ght
3: open grouping, look for [abcdef], not found, look for gh, close gro
+uping, found, look for t
4: open grouping, look for a, not found, look for b, not found, look f
+or c, ... look for gh, close grouping, found, look for t.
Just from speaking it out I can see that 4 looks like the slowest and 2 looks faster than 3. I too was puzzled by 0 until I look at it from the view that the compiler probably doesn't compile the /bt/ regex unless /at/ fails, so if the word has an 'at' in it, none of the other regexes will fail, and even compiling two or three of those regexes may be faster than compiling and processing a much larger, more complex one. That has more to do with the rules of how the boolean operators work rather than how regexes are optimized.
But again, I have absolutely no clue how it's done under the hood. That's just me looking at it and talking to myself. | [reply] [d/l] |
Re: Regexp Speed Concerns
by chipmunk (Parson) on Jan 12, 2001 at 20:44 UTC
|
Want to see how a regular expression is compiled? Perl can tell you!
perldebug:
Debugging regular expressions
There are two ways to enable debugging output for regular
expressions.
If your perl is compiled with -DDEBUGGING, you may use the
-Dr flag on the command line.
Otherwise, one can use re 'debug', which has effects both
at compile time, and at run time (and is not lexically
scoped).
[snip; read the details in the perldebug documentation.]
The really nice part is that, unlike most of the low-level debugging features, you don't need a DEBUGGING build of Perl to debug your regular expressions.
#!perl -c
use re 'debug';
/at/||/bt/||/ct/||/dt/||/et/||/ft/||/ght/;
/at|bt|ct|dt|et|ft|ght/;
/[abcdef]t|ght/;
/(?:[abcdef]|gh)t/;
/(?:a|b|c|d|e|f|gt)t/;
__END__
Here's the output from 5.005_03, reformatted for a little extra clarity:
/at/||/bt/||/ct/||/dt/||/et/||/ft/||/ght/;
compiling RE `at'
size 3 first at 1
1: EXACT <at>(3)
3: END(0)
anchored `at' at 0 (checking anchored isall) minlen 2
compiling RE `bt'
size 3 first at 1
1: EXACT <bt>(3)
3: END(0)
anchored `bt' at 0 (checking anchored isall) minlen 2
compiling RE `ct'
size 3 first at 1
1: EXACT <ct>(3)
3: END(0)
anchored `ct' at 0 (checking anchored isall) minlen 2
compiling RE `dt'
size 3 first at 1
1: EXACT <dt>(3)
3: END(0)
anchored `dt' at 0 (checking anchored isall) minlen 2
compiling RE `et'
size 3 first at 1
1: EXACT <et>(3)
3: END(0)
anchored `et' at 0 (checking anchored isall) minlen 2
compiling RE `ft'
size 3 first at 1
1: EXACT <ft>(3)
3: END(0)
anchored `ft' at 0 (checking anchored isall) minlen 2
compiling RE `ght'
size 4 first at 1
1: EXACT <ght>(4)
4: END(0)
anchored `ght' at 0 (checking anchored isall) minlen 3
/at|bt|ct|dt|et|ft|ght/;
compiling RE `at|bt|ct|dt|et|ft|ght'
size 23
1: BRANCH(4)
2: EXACT <at>(23)
4: BRANCH(7)
5: EXACT <bt>(23)
7: BRANCH(10)
8: EXACT <ct>(23)
10: BRANCH(13)
11: EXACT <dt>(23)
13: BRANCH(16)
14: EXACT <et>(23)
16: BRANCH(19)
17: EXACT <ft>(23)
19: BRANCH(23)
20: EXACT <ght>(23)
23: END(0)
minlen 2
/[abcdef]t|ght/;
compiling RE `[abcdef]t|ght'
size 18
1: BRANCH(14)
2: ANYOF(12)
12: EXACT <t>(18)
14: BRANCH(18)
15: EXACT <ght>(18)
18: END(0)
minlen 2
/(?:[abcdef]|gh)t/;
compiling RE `(?:[abcdef]|gh)t'
size 18
1: BRANCH(12)
2: ANYOF(16)
12: BRANCH(15)
13: EXACT <gh>(16)
15: TAIL(16)
16: EXACT <t>(18)
18: END(0)
minlen 2
/(?:a|b|c|d|e|f|gt)t/;
compiling RE `(?:a|b|c|d|e|f|gt)t'
size 25
1: BRANCH(4)
2: EXACT <a>(23)
4: BRANCH(7)
5: EXACT <b>(23)
7: BRANCH(10)
8: EXACT <c>(23)
10: BRANCH(13)
11: EXACT <d>(23)
13: BRANCH(16)
14: EXACT <e>(23)
16: BRANCH(19)
17: EXACT <f>(23)
19: BRANCH(22)
20: EXACT <gt>(23)
22: TAIL(23)
23: EXACT <t>(25)
25: END(0)
minlen 2
Of course, that's just how the regular expressions are compiled; to see how the regexes are matched -- which is where the optimizer really comes into play -- give the regexes something to match against and watch the output at runtime! | [reply] [d/l] [select] |
|
|
That DEBUGGING option is pretty powerful, thanks for the tip. That is proof
positive of what I have been suspecting, that Perl will
"compile" the regexps, but it doesn't perform any optimizations
in any intelligent sense. I was expecting a behaviour like
"lex", which takes an "expression" and returns a fairly
optimized state-machine which parses the defined tokens.
I am considering making an optimizer which will tweak a given
regexp into something more efficient, such that you could give
it something and it would polish it into something known to
Benchmark faster. Of course, discovering all the tricks would
be a bit of a chore, which is part of the challenge.
My idea is that this hypothetical function would take a
regexp as input and return the optimized, but functionally
equivalent, version of same:
Input Output
---------------------------------------
'x|xy' 'xy?'
'xx|xyx' 'xy?x'
'a|abc' 'a(bc)?'
'a|ab|abc' 'a(b(c)?)?'
'a1|a2|a3' 'a[123]'
The intention is to add optimization techniques as they
are discovered, plus an implementation strategy is devised.
Certainly, not every input regexp will be understood
to the degree that it can be optimized, and these will be
left undisturbed.
Already, I am finding that the "optimial" solution is
somewhat obscure in even simple cases:
/that|this|thinks|thirst|three/
-> /th(at|i(?:(?:nk)?s|rst)|ree/ ?
-> /th(?:at|is|inks|irst|ree)/ ?
-> /th(?:at|i(?:nk)?s|irst|ree)/ ?
Just food for thought. | [reply] [d/l] [select] |
|
|
| [reply] |
Re: Regexp Speed Concerns
by fundflow (Chaplain) on Jan 12, 2001 at 07:28 UTC
|
Just a note: if memory serves, regexp equivalence is not an
easy question in general and thats why the compiler is unlikely to give
the exact same output.
That is, it is hard to say whether two regexps are equivalent.
(is it in NP? someone can surely fill up on that) | [reply] |
|
|
| [reply] |
|
|
Here's an interesting development.
A related problem which is PSPACE-complete is the problem: Given
a regex R, is there any string that R will not match?
Now, it's not known for sure that PSPACE-complete problems are intractible, although
it's very likely. (And in particular, PSPACE-complete problems
are at least as hard as NP-complete problems.) But here's the
interesting part.
Normally, in computer science, regex notations only include literal symbols,
concatenation, |, *, and
parenthese for grouping.
If you extend thge regex notation to allow
{2} as well, the problem becomes provably intractible
and can be shown to require an exponential amount of space.
| [reply] |
Re: Regexp Speed Concerns
by Anonymous Monk on Jan 13, 2001 at 05:49 UTC
|
As far as your test results go, to paraphrase the Third Camel, short-circuit alternation can be faster than the corresponding regex when the optimizer determines that your search expression (or portions of your search expression) can simply be handed off to a Boyer-Moore search algorithm which does very fast string matching. Note the word "string". This is why the following:
/at/ || /bt/ || /ct/
is faster than
/at|bt|ct/
Just as in your pattern 0, the regex engine does not even have to become involoved - there is no backtracking involved, no backreferences, and no quantifiers, hence simple string searching can be done with the B-M algorithm, which, as stated, is very fast. In fact, the longer the pattern, the faster B-M is, on average, because it attempts to be somewhat intelligent and selectively skips sections of input where it determines that no match is possible, rather than trying and backtracking over every single character. This is relevant to the large expressions you mentioned - if they involve long constant strings, then you will likely see a significant speed-up by using short-circuit alternation as demonstrated in your pattern 0.
Of course, for increasingly complex patterns with multiple quantifiers, etc., you are better off with a carefully crafted regex.
Other optimizations:
-use //o or qr// to avoid recompilation if the pattern does not change
-As TChrist et all teach us, have your program build a string containing the expression and eval "$code" it
-create a nonbacktracking subpattern using (?>PATTERN). This is like the cluster without capture modifier you used in patterns 3 and 4 (?:PATTERN) except it suppresses backtracking on any quanitifiers or alternations inside the subpattern.
For more about the Boyer-Moore algorithm, see:
A Fast String Searching Algorithm, R.S. Boyer & J.S. Moore in Communcations of the Association for Computing Machinery, 20(10), 1977, pp762-772
The classic CS text on regular expression theory (see Ch. 3):
Compilers - Principles, Techniques, and Tools, by Aho, Sethi, and Ullman
A regular expression is pretty smart, but it's smart like a horse. It can get distracted if it sees to much at once. So sometimes you have to put blinders onto it.
-Programming Perl, 3rd Ed. | [reply] [d/l] [select] |
Re: Regexp Speed Concerns
by ColonelPanic (Friar) on Jan 12, 2001 at 19:31 UTC
|
Thanks for this informative dicussion. I just sped up one of my scripts' running times dramatically, just by changing from /thing1|thing2/ to /thing1/ || /thing2/. This made a huge difference because I was doing a fairly long match over a file of several hundred lines, on a CGI script that runs many times a day. Thanks once again, Perl Monks | [reply] [d/l] |