Regexp Speed Concerns

tadman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(Ovid) Re: Regexp Speed Concerns by Ovid (Cardinal) on Jan 12, 2001 at 10:42 UTC
As has been commented previously, you should read Mastering Regular Expressions. It is the definitive work on regexes and has no peer. Unfortunately, it is rather out of date, but the author, Jeffrey Friedl is writing a third edition which I understand should be out shortly. If you are familiar with regular expressions, don't make the mistake of thinking that all regular expression engines are the same. They fall into three general categories: DFA. Fast, but does not allow back references (e.g. capture to $1). I understand that someone is writing a DFA engine that can do backreferences, but I'm too lazy to look up the reference right now :) Traditional NFA. This is the regex engine that Perl uses and, amongst "those in the know", is often considered the best. It's fairly fast and allows backreferences. POSIX NFA. Every once in a while, a standards body comes out with a poor standard. This is one of them. POSIX NFA usually tries for the "longest" match and, as such, can take an incredible amount of time with some regular expressions. Traditional NFA usually goes for the first match, thus allowing greater speed. Between POSIX and traditional NFA engines, both can take long time verifying a failed match (sometimes longer than the than the length of time the universe has existed!), but traditional NFA tends to be much faster on successful matches. Optimizing regular expressions, however, takes a lot of work and is far beyond the scope of this post. One thing you can do is read my shameless plug. However, the issues raised in that post aren't terribly relevant to what you're doing here. The issue you are having is the following: when dealing with choices for single character, use a character class, but *don't* alternate!!! The following seem equivalent, but they're not: `$string =~ /[abc]/; # character class $string =~ /a\|b\|c/; # alternation` [download] It is almost guaranteed that the first will run faster. The reason for this is simple: when dealing with a character class, Perl's regex engine kind of creates a "sieve" that the characters pass through. If it doesn't go through the "abc" sieve, it doesn't pass. When you alternate, as in the second example, Perl's regex engine attempts to match the 'a'. If that fails, backs up to the end of the last successful match and tries the 'b'. If that fails, it again notes the end of the last successful match and tries 'c'. In short, it's forced to mark its position and continuously retry the match, but the character class simply allows 'a', 'b', or 'c' to pass. The alternation problem is worsened when dealing with multiple characters. `$string =~ /bob/ \|\| /alice/; # Good $string =~ /(?:bob\|alice)/; # Bad` [download] The first example will simply try to match the target strings. The second example will keep trying to match, move forward a character after the failure, try to match again, etc, etc (no, this isn't quite accurate, but I'm trying to be brief). This is almost guaranteed to be slower. However, these speed issues are more of a concern when iterating over data. If it's a single shot regex, it's usually not that big of a consideration. Regardless of all of the drivel here, it's important to take into consideration the actual data to be used. Testing against hypothetical data that really doesn't match what you are going to use is a poor test of the efficiency of your solution (this is just an aside. I have no idea if it's directly applicable to what you're doing). However, if you create sample data that closely mimics what you want, benchmarking your data is much more likely to produce relevant results. Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l] [select]
Re: Regexp Speed Concerns by tilly (Archbishop) on Jan 12, 2001 at 08:02 UTC
As you might expect this has come up before. The classic book to read is Mastering Regular Expressions. For threads that have touched on this see RE (tilly) 4: SAS log scanner and Numeric list to optimised regexp. And for an interesting discussion see also this. Unfortunately I don't think anyone will try to actually implement that in the near future, but you may find it interesting to read.	[reply]
Re: Regexp Speed Concerns by extremely (Priest) on Jan 12, 2001 at 07:40 UTC
You are sure going to be happy to find out there is an entire book on this subject. =) Mastering Regular Expressions is a fine, fine book. Highly recommended. -- $you = new YOU; honk() if $you->love(perl)	[reply]
Re: Regexp Speed Concerns by Hrunting (Pilgrim) on Jan 12, 2001 at 07:50 UTC
I've never looked at the code that perl uses to handle regexes, both compiling and optimizing, but from just looking at it and talking through it, I can begin to see why you get different results: `1: look for at, not found, look for bt, not found, look for ct ... 2: look for [abcdef], found, look for t, not found, look for ght 3: open grouping, look for [abcdef], not found, look for gh, close gro +uping, found, look for t 4: open grouping, look for a, not found, look for b, not found, look f +or c, ... look for gh, close grouping, found, look for t.` [download] Just from speaking it out I can see that 4 looks like the slowest and 2 looks faster than 3. I too was puzzled by 0 until I look at it from the view that the compiler probably doesn't compile the /bt/ regex unless /at/ fails, so if the word has an 'at' in it, none of the other regexes will fail, and even compiling two or three of those regexes may be faster than compiling and processing a much larger, more complex one. That has more to do with the rules of how the boolean operators work rather than how regexes are optimized. But again, I have absolutely no clue how it's done under the hood. That's just me looking at it and talking to myself.	[reply] [d/l]
Re: Regexp Speed Concerns by chipmunk (Parson) on Jan 12, 2001 at 20:44 UTC
Want to see how a regular expression is compiled? Perl can tell you! perldebug: `Debugging regular expressions There are two ways to enable debugging output for regular expressions. If your perl is compiled with -DDEBUGGING, you may use the -Dr flag on the command line. Otherwise, one can use re 'debug', which has effects both at compile time, and at run time (and is not lexically scoped). [snip; read the details in the perldebug documentation.]` [download] The really nice part is that, unlike most of the low-level debugging features, you don't need a DEBUGGING build of Perl to debug your regular expressions. `#!perl -c use re 'debug'; /at/\|\|/bt/\|\|/ct/\|\|/dt/\|\|/et/\|\|/ft/\|\|/ght/; /at\|bt\|ct\|dt\|et\|ft\|ght/; /[abcdef]t\|ght/; /(?:[abcdef]\|gh)t/; /(?:a\|b\|c\|d\|e\|f\|gt)t/; __END__` [download] Here's the output from 5.005_03, reformatted for a little extra clarity: /at/\|\|/bt/\|\|/ct/\|\|/dt/\|\|/et/\|\|/ft/\|\|/ght/; compiling RE `at' size 3 first at 1 1: EXACT <at>(3) 3: END(0) anchored `at' at 0 (checking anchored isall) minlen 2 compiling RE `bt' size 3 first at 1 1: EXACT <bt>(3) 3: END(0) anchored `bt' at 0 (checking anchored isall) minlen 2 compiling RE `ct' size 3 first at 1 1: EXACT <ct>(3) 3: END(0) anchored `ct' at 0 (checking anchored isall) minlen 2 compiling RE `dt' size 3 first at 1 1: EXACT <dt>(3) 3: END(0) anchored `dt' at 0 (checking anchored isall) minlen 2 compiling RE `et' size 3 first at 1 1: EXACT <et>(3) 3: END(0) anchored `et' at 0 (checking anchored isall) minlen 2 compiling RE `ft' size 3 first at 1 1: EXACT <ft>(3) 3: END(0) anchored `ft' at 0 (checking anchored isall) minlen 2 compiling RE `ght' size 4 first at 1 1: EXACT <ght>(4) 4: END(0) anchored `ght' at 0 (checking anchored isall) minlen 3 /at\|bt\|ct\|dt\|et\|ft\|ght/; compiling RE `at\|bt\|ct\|dt\|et\|ft\|ght' size 23 1: BRANCH(4) 2: EXACT <at>(23) 4: BRANCH(7) 5: EXACT <bt>(23) 7: BRANCH(10) 8: EXACT <ct>(23) 10: BRANCH(13) 11: EXACT <dt>(23) 13: BRANCH(16) 14: EXACT <et>(23) 16: BRANCH(19) 17: EXACT <ft>(23) 19: BRANCH(23) 20: EXACT <ght>(23) 23: END(0) minlen 2 /[abcdef]t\|ght/; compiling RE `[abcdef]t\|ght' size 18 1: BRANCH(14) 2: ANYOF(12) 12: EXACT <t>(18) 14: BRANCH(18) 15: EXACT <ght>(18) 18: END(0) minlen 2 /(?:[abcdef]\|gh)t/; compiling RE `(?:[abcdef]\|gh)t' size 18 1: BRANCH(12) 2: ANYOF(16) 12: BRANCH(15) 13: EXACT <gh>(16) 15: TAIL(16) 16: EXACT <t>(18) 18: END(0) minlen 2 /(?:a\|b\|c\|d\|e\|f\|gt)t/; compiling RE `(?:a\|b\|c\|d\|e\|f\|gt)t' size 25 1: BRANCH(4) 2: EXACT <a>(23) 4: BRANCH(7) 5: EXACT <b>(23) 7: BRANCH(10) 8: EXACT <c>(23) 10: BRANCH(13) 11: EXACT <d>(23) 13: BRANCH(16) 14: EXACT <e>(23) 16: BRANCH(19) 17: EXACT <f>(23) 19: BRANCH(22) 20: EXACT <gt>(23) 22: TAIL(23) 23: EXACT <t>(25) 25: END(0) minlen 2 [download] Of course, that's just how the regular expressions are compiled; to see how the regexes are matched -- which is where the optimizer really comes into play -- give the regexes something to match against and watch the output at runtime!	[reply] [d/l] [select]
Re^2: Regexp Speed Concerns by tadman (Prior) on Jan 13, 2001 at 03:31 UTC
That DEBUGGING option is pretty powerful, thanks for the tip. That is proof positive of what I have been suspecting, that Perl will "compile" the regexps, but it doesn't perform any optimizations in any intelligent sense. I was expecting a behaviour like "lex", which takes an "expression" and returns a fairly optimized state-machine which parses the defined tokens. I am considering making an optimizer which will tweak a given regexp into something more efficient, such that you could give it something and it would polish it into something known to Benchmark faster. Of course, discovering all the tricks would be a bit of a chore, which is part of the challenge. My idea is that this hypothetical function would take a regexp as input and return the optimized, but functionally equivalent, version of same: `Input Output --------------------------------------- 'x\|xy' 'xy?' 'xx\|xyx' 'xy?x' 'a\|abc' 'a(bc)?' 'a\|ab\|abc' 'a(b(c)?)?' 'a1\|a2\|a3' 'a[123]'` [download] The intention is to add optimization techniques as they are discovered, plus an implementation strategy is devised. Certainly, not every input regexp will be understood to the degree that it can be optimized, and these will be left undisturbed. Already, I am finding that the "optimial" solution is somewhat obscure in even simple cases: `/that\|this\|thinks\|thirst\|three/ -> /th(at\|i(?:(?:nk)?s\|rst)\|ree/ ? -> /th(?:at\|is\|inks\|irst\|ree)/ ? -> /th(?:at\|i(?:nk)?s\|irst\|ree)/ ?` [download] Just food for thought.	[reply] [d/l] [select]
Re (tilly) 3: Regexp Speed Concerns by tilly (Archbishop) on Jan 13, 2001 at 04:24 UTC
Did you miss Re: Regexp Speed Concerns? I pointed there to two links here where there is code for automatic optimization of REs. The third link is to a general purpose optimization which would need to be done within Perl's engine, but which could lead to very good things. As for why Perl doesn't do what lex does, the finite state engine that lex produces is a DFA, and so unable to handle backreferences or tell the difference between .* and .*?.	[reply]
Re: Regexp Speed Concerns by fundflow (Chaplain) on Jan 12, 2001 at 07:28 UTC
Just a note: if memory serves, regexp equivalence is not an easy question in general and thats why the compiler is unlikely to give the exact same output. That is, it is hard to say whether two regexps are equivalent. (is it in NP? someone can surely fill up on that)	[reply]
Re: Regexp Speed Concerns by Dominus (Parson) on Jan 12, 2001 at 20:45 UTC
Says fundflow: if memory serves, regexp equivalence is not an easy question You are absolutely right. Very little is known about regexes. What little is known indicates that the problems are all very hard. (is it in NP? someone can surely fill up on that) It is probably not in NP, but nobody knows for sure. Regex inequivalence ("Given two regexes, do they represent different languages?") is known to be PSPACE-complete. PSPACE-complete problems are at least has hard as NP-complete problems, but probably harder. If the regexes contain only the concatenation and `\|` operators (no ``'s), the problem is "only" NP-complete. See (of course) Garey & Johnson Computers and Intractibility*, p. 267.	[reply]
Re: Regexp Speed Concerns by Dominus (Parson) on Jan 13, 2001 at 07:09 UTC
Here's an interesting development. A related problem which is PSPACE-complete is the problem: Given a regex R, is there any string that R will not match? Now, it's not known for sure that PSPACE-complete problems are intractible, although it's very likely. (And in particular, PSPACE-complete problems are at least as hard as NP-complete problems.) But here's the interesting part. Normally, in computer science, regex notations only include literal symbols, concatenation, `\|`, ``, and parenthese for grouping. If you extend thge regex notation to allow `{2}` as well, the problem becomes provably intractible* and can be shown to require an exponential amount of space.	[reply]
Re: Regexp Speed Concerns by Anonymous Monk on Jan 13, 2001 at 05:49 UTC
As far as your test results go, to paraphrase the Third Camel, short-circuit alternation can be faster than the corresponding regex when the optimizer determines that your search expression (or portions of your search expression) can simply be handed off to a Boyer-Moore search algorithm which does very fast string matching. Note the word "string". This is why the following: `/at/ \|\| /bt/ \|\| /ct/` is faster than `/at\|bt\|ct/` Just as in your pattern 0, the regex engine does not even have to become involoved - there is no backtracking involved, no backreferences, and no quantifiers, hence simple string searching can be done with the B-M algorithm, which, as stated, is very fast. In fact, the longer the pattern, the faster B-M is, on average, because it attempts to be somewhat intelligent and selectively skips sections of input where it determines that no match is possible, rather than trying and backtracking over every single character. This is relevant to the large expressions you mentioned - if they involve long constant strings, then you will likely see a significant speed-up by using short-circuit alternation as demonstrated in your pattern 0. Of course, for increasingly complex patterns with multiple quantifiers, etc., you are better off with a carefully crafted regex. Other optimizations: -use //o or qr// to avoid recompilation if the pattern does not change -As TChrist et all teach us, have your program build a string containing the expression and eval "$code" it -create a nonbacktracking subpattern using (?>PATTERN). This is like the cluster without capture modifier you used in patterns 3 and 4 (?:PATTERN) except it suppresses backtracking on any quanitifiers or alternations inside the subpattern. For more about the Boyer-Moore algorithm, see: A Fast String Searching Algorithm, R.S. Boyer & J.S. Moore in Communcations of the Association for Computing Machinery, 20(10), 1977, pp762-772 The classic CS text on regular expression theory (see Ch. 3): Compilers - Principles, Techniques, and Tools, by Aho, Sethi, and Ullman A regular expression is pretty smart, but it's smart like a horse. It can get distracted if it sees to much at once. So sometimes you have to put blinders onto it. -Programming Perl, 3rd Ed.	[reply] [d/l] [select]
Re: Regexp Speed Concerns by ColonelPanic (Friar) on Jan 12, 2001 at 19:31 UTC
Thanks for this informative dicussion. I just sped up one of my scripts' running times dramatically, just by changing from `/thing1\|thing2/ to /thing1/ \|\| /thing2/`. This made a huge difference because I was doing a fairly long match over a file of several hundred lines, on a CGI script that runs many times a day. Thanks once again, Perl Monks	[reply] [d/l]