tadman has asked for the wisdom of the Perl Monks concerning the following question:

The regexp compiler in Perl is an incredible thing, but it is not always clear how it works its magic. My concern revolves around large regexps, and how they can be optimized to improve speed, something that I'm sure comes up a lot.

I was wondering, should I optimize my regexps, or will Perl do that for me anyway when it compiles which would make my effort wasted?

So I composed a quick test to find out.

Which one of the following functionally equivalent statements is the fastest way to find pattern matches?
0: /at/||/bt/||/ct/||/dt/||/et/||/ft/||/ght/ 1: /at|bt|ct|dt|et|ft|ght/ 2: /[abcdef]t|ght/ 3: /(?:[abcdef]|gh)t/ 4: /(?:a|b|c|d|e|f|gt)t/
I was expecting that Perl would compile 1..4 to exactly the same thing, but Benchmark shows that is not the case. Internally, there must be substantial differences in the way they are implemented.

The test I composed used data from /usr/dict/words in two ways. One was to test on many small bits of data, such as the words, and the other was to test on the entire file.

On small data (@test = <DICT>) the winner is 0, which I found surprising. However, it is only 15% faster than 2 and 3, which were tied for second (not unexpectedly). Then came 1, which was only slightly faster than 4, and the two of them are 45%+ slower than the others. The regexp 'or' operator sure slows things down, it would seem.

On the large dataset (@test = join (',',<DICT>)), 0 was too cumbersome to be implemented. The rest came up in the same order, 2,3,1,4, as before. No surprise there.

I'm not sure if I need a more ambitious test, but I think it is clear that the compiler doesn't do everything for you by any degree, and that the programmer certainly has to make an effort to construct the most efficient regexp.

I'm curious, though, why patterns 1..4 aren't compiled the same.

Replies are listed 'Best First'.
(Ovid) Re: Regexp Speed Concerns
by Ovid (Cardinal) on Jan 12, 2001 at 10:42 UTC
    As has been commented previously, you should read Mastering Regular Expressions. It is the definitive work on regexes and has no peer. Unfortunately, it is rather out of date, but the author, Jeffrey Friedl is writing a third edition which I understand should be out shortly.

    If you are familiar with regular expressions, don't make the mistake of thinking that all regular expression engines are the same. They fall into three general categories:

    1. DFA. Fast, but does not allow back references (e.g. capture to $1). I understand that someone is writing a DFA engine that can do backreferences, but I'm too lazy to look up the reference right now :)
    2. Traditional NFA. This is the regex engine that Perl uses and, amongst "those in the know", is often considered the best. It's fairly fast and allows backreferences.
    3. POSIX NFA. Every once in a while, a standards body comes out with a poor standard. This is one of them. POSIX NFA usually tries for the "longest" match and, as such, can take an incredible amount of time with some regular expressions. Traditional NFA usually goes for the first match, thus allowing greater speed.

    Between POSIX and traditional NFA engines, both can take long time verifying a failed match (sometimes longer than the than the length of time the universe has existed!), but traditional NFA tends to be much faster on successful matches.

    Optimizing regular expressions, however, takes a lot of work and is far beyond the scope of this post. One thing you can do is read my shameless plug. However, the issues raised in that post aren't terribly relevant to what you're doing here.

    The issue you are having is the following: when dealing with choices for single character, use a character class, but don't alternate!!! The following seem equivalent, but they're not:

    $string =~ /[abc]/; # character class $string =~ /a|b|c/; # alternation
    It is almost guaranteed that the first will run faster. The reason for this is simple: when dealing with a character class, Perl's regex engine kind of creates a "sieve" that the characters pass through. If it doesn't go through the "abc" sieve, it doesn't pass. When you alternate, as in the second example, Perl's regex engine attempts to match the 'a'. If that fails, backs up to the end of the last successful match and tries the 'b'. If that fails, it again notes the end of the last successful match and tries 'c'. In short, it's forced to mark its position and continuously retry the match, but the character class simply allows 'a', 'b', or 'c' to pass.

    The alternation problem is worsened when dealing with multiple characters.

    $string =~ /bob/ || /alice/; # Good $string =~ /(?:bob|alice)/; # Bad
    The first example will simply try to match the target strings. The second example will keep trying to match, move forward a character after the failure, try to match again, etc, etc (no, this isn't quite accurate, but I'm trying to be brief). This is almost guaranteed to be slower. However, these speed issues are more of a concern when iterating over data. If it's a single shot regex, it's usually not that big of a consideration.

    Regardless of all of the drivel here, it's important to take into consideration the actual data to be used. Testing against hypothetical data that really doesn't match what you are going to use is a poor test of the efficiency of your solution (this is just an aside. I have no idea if it's directly applicable to what you're doing). However, if you create sample data that closely mimics what you want, benchmarking your data is much more likely to produce relevant results.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Regexp Speed Concerns
by tilly (Archbishop) on Jan 12, 2001 at 08:02 UTC
Re: Regexp Speed Concerns
by extremely (Priest) on Jan 12, 2001 at 07:40 UTC
    You are sure going to be happy to find out there is an entire book on this subject. =) Mastering Regular Expressions is a fine, fine book. Highly recommended.

    --
    $you = new YOU;
    honk() if $you->love(perl)

Re: Regexp Speed Concerns
by Hrunting (Pilgrim) on Jan 12, 2001 at 07:50 UTC
    I've never looked at the code that perl uses to handle regexes, both compiling and optimizing, but from just looking at it and talking through it, I can begin to see why you get different results:

    1: look for at, not found, look for bt, not found, look for ct ... 2: look for [abcdef], found, look for t, not found, look for ght 3: open grouping, look for [abcdef], not found, look for gh, close gro +uping, found, look for t 4: open grouping, look for a, not found, look for b, not found, look f +or c, ... look for gh, close grouping, found, look for t.
    Just from speaking it out I can see that 4 looks like the slowest and 2 looks faster than 3. I too was puzzled by 0 until I look at it from the view that the compiler probably doesn't compile the /bt/ regex unless /at/ fails, so if the word has an 'at' in it, none of the other regexes will fail, and even compiling two or three of those regexes may be faster than compiling and processing a much larger, more complex one. That has more to do with the rules of how the boolean operators work rather than how regexes are optimized. But again, I have absolutely no clue how it's done under the hood. That's just me looking at it and talking to myself.
Re: Regexp Speed Concerns
by chipmunk (Parson) on Jan 12, 2001 at 20:44 UTC
    Want to see how a regular expression is compiled? Perl can tell you!

    perldebug:

    Debugging regular expressions There are two ways to enable debugging output for regular expressions. If your perl is compiled with -DDEBUGGING, you may use the -Dr flag on the command line. Otherwise, one can use re 'debug', which has effects both at compile time, and at run time (and is not lexically scoped). [snip; read the details in the perldebug documentation.]
    The really nice part is that, unlike most of the low-level debugging features, you don't need a DEBUGGING build of Perl to debug your regular expressions.
    #!perl -c use re 'debug'; /at/||/bt/||/ct/||/dt/||/et/||/ft/||/ght/; /at|bt|ct|dt|et|ft|ght/; /[abcdef]t|ght/; /(?:[abcdef]|gh)t/; /(?:a|b|c|d|e|f|gt)t/; __END__
    Here's the output from 5.005_03, reformatted for a little extra clarity:
    /at/||/bt/||/ct/||/dt/||/et/||/ft/||/ght/; compiling RE `at' size 3 first at 1 1: EXACT <at>(3) 3: END(0) anchored `at' at 0 (checking anchored isall) minlen 2 compiling RE `bt' size 3 first at 1 1: EXACT <bt>(3) 3: END(0) anchored `bt' at 0 (checking anchored isall) minlen 2 compiling RE `ct' size 3 first at 1 1: EXACT <ct>(3) 3: END(0) anchored `ct' at 0 (checking anchored isall) minlen 2 compiling RE `dt' size 3 first at 1 1: EXACT <dt>(3) 3: END(0) anchored `dt' at 0 (checking anchored isall) minlen 2 compiling RE `et' size 3 first at 1 1: EXACT <et>(3) 3: END(0) anchored `et' at 0 (checking anchored isall) minlen 2 compiling RE `ft' size 3 first at 1 1: EXACT <ft>(3) 3: END(0) anchored `ft' at 0 (checking anchored isall) minlen 2 compiling RE `ght' size 4 first at 1 1: EXACT <ght>(4) 4: END(0) anchored `ght' at 0 (checking anchored isall) minlen 3 /at|bt|ct|dt|et|ft|ght/; compiling RE `at|bt|ct|dt|et|ft|ght' size 23 1: BRANCH(4) 2: EXACT <at>(23) 4: BRANCH(7) 5: EXACT <bt>(23) 7: BRANCH(10) 8: EXACT <ct>(23) 10: BRANCH(13) 11: EXACT <dt>(23) 13: BRANCH(16) 14: EXACT <et>(23) 16: BRANCH(19) 17: EXACT <ft>(23) 19: BRANCH(23) 20: EXACT <ght>(23) 23: END(0) minlen 2 /[abcdef]t|ght/; compiling RE `[abcdef]t|ght' size 18 1: BRANCH(14) 2: ANYOF(12) 12: EXACT <t>(18) 14: BRANCH(18) 15: EXACT <ght>(18) 18: END(0) minlen 2 /(?:[abcdef]|gh)t/; compiling RE `(?:[abcdef]|gh)t' size 18 1: BRANCH(12) 2: ANYOF(16) 12: BRANCH(15) 13: EXACT <gh>(16) 15: TAIL(16) 16: EXACT <t>(18) 18: END(0) minlen 2 /(?:a|b|c|d|e|f|gt)t/; compiling RE `(?:a|b|c|d|e|f|gt)t' size 25 1: BRANCH(4) 2: EXACT <a>(23) 4: BRANCH(7) 5: EXACT <b>(23) 7: BRANCH(10) 8: EXACT <c>(23) 10: BRANCH(13) 11: EXACT <d>(23) 13: BRANCH(16) 14: EXACT <e>(23) 16: BRANCH(19) 17: EXACT <f>(23) 19: BRANCH(22) 20: EXACT <gt>(23) 22: TAIL(23) 23: EXACT <t>(25) 25: END(0) minlen 2
    Of course, that's just how the regular expressions are compiled; to see how the regexes are matched -- which is where the optimizer really comes into play -- give the regexes something to match against and watch the output at runtime!
      That DEBUGGING option is pretty powerful, thanks for the tip. That is proof positive of what I have been suspecting, that Perl will "compile" the regexps, but it doesn't perform any optimizations in any intelligent sense. I was expecting a behaviour like "lex", which takes an "expression" and returns a fairly optimized state-machine which parses the defined tokens.

      I am considering making an optimizer which will tweak a given regexp into something more efficient, such that you could give it something and it would polish it into something known to Benchmark faster. Of course, discovering all the tricks would be a bit of a chore, which is part of the challenge.

      My idea is that this hypothetical function would take a regexp as input and return the optimized, but functionally equivalent, version of same:
      Input Output --------------------------------------- 'x|xy' 'xy?' 'xx|xyx' 'xy?x' 'a|abc' 'a(bc)?' 'a|ab|abc' 'a(b(c)?)?' 'a1|a2|a3' 'a[123]'
      The intention is to add optimization techniques as they are discovered, plus an implementation strategy is devised. Certainly, not every input regexp will be understood to the degree that it can be optimized, and these will be left undisturbed.

      Already, I am finding that the "optimial" solution is somewhat obscure in even simple cases:
      /that|this|thinks|thirst|three/ -> /th(at|i(?:(?:nk)?s|rst)|ree/ ? -> /th(?:at|is|inks|irst|ree)/ ? -> /th(?:at|i(?:nk)?s|irst|ree)/ ?

      Just food for thought.
        Did you miss Re: Regexp Speed Concerns? I pointed there to two links here where there is code for automatic optimization of REs. The third link is to a general purpose optimization which would need to be done within Perl's engine, but which could lead to very good things.

        As for why Perl doesn't do what lex does, the finite state engine that lex produces is a DFA, and so unable to handle backreferences or tell the difference between .* and .*?.

Re: Regexp Speed Concerns
by fundflow (Chaplain) on Jan 12, 2001 at 07:28 UTC
    Just a note: if memory serves, regexp equivalence is not an easy question in general and thats why the compiler is unlikely to give the exact same output.

    That is, it is hard to say whether two regexps are equivalent.

    (is it in NP? someone can surely fill up on that)

      Says fundflow:
      if memory serves, regexp equivalence is not an easy question
      You are absolutely right. Very little is known about regexes. What little is known indicates that the problems are all very hard.

      (is it in NP? someone can surely fill up on that)
      It is probably not in NP, but nobody knows for sure. Regex inequivalence ("Given two regexes, do they represent different languages?") is known to be PSPACE-complete. PSPACE-complete problems are at least has hard as NP-complete problems, but probably harder. If the regexes contain only the concatenation and | operators (no *'s), the problem is "only" NP-complete.

      See (of course) Garey & Johnson Computers and Intractibility, p. 267.

        Here's an interesting development.

        A related problem which is PSPACE-complete is the problem: Given a regex R, is there any string that R will not match? Now, it's not known for sure that PSPACE-complete problems are intractible, although it's very likely. (And in particular, PSPACE-complete problems are at least as hard as NP-complete problems.) But here's the interesting part. Normally, in computer science, regex notations only include literal symbols, concatenation, |, *, and parenthese for grouping. If you extend thge regex notation to allow {2} as well, the problem becomes provably intractible and can be shown to require an exponential amount of space.

Re: Regexp Speed Concerns
by Anonymous Monk on Jan 13, 2001 at 05:49 UTC
    As far as your test results go, to paraphrase the Third Camel, short-circuit alternation can be faster than the corresponding regex when the optimizer determines that your search expression (or portions of your search expression) can simply be handed off to a Boyer-Moore search algorithm which does very fast string matching. Note the word "string". This is why the following:

    /at/ || /bt/ || /ct/

    is faster than

       /at|bt|ct/

    Just as in your pattern 0, the regex engine does not even have to become involoved - there is no backtracking involved, no backreferences, and no quantifiers, hence simple string searching can be done with the B-M algorithm, which, as stated, is very fast. In fact, the longer the pattern, the faster B-M is, on average, because it attempts to be somewhat intelligent and selectively skips sections of input where it determines that no match is possible, rather than trying and backtracking over every single character. This is relevant to the large expressions you mentioned - if they involve long constant strings, then you will likely see a significant speed-up by using short-circuit alternation as demonstrated in your pattern 0.

    Of course, for increasingly complex patterns with multiple quantifiers, etc., you are better off with a carefully crafted regex.

    Other optimizations:
    -use //o or qr// to avoid recompilation if the pattern does not change
    -As TChrist et all teach us, have your program build a string containing the expression and eval "$code" it
    -create a nonbacktracking subpattern using (?>PATTERN). This is like the cluster without capture modifier you used in patterns 3 and 4 (?:PATTERN) except it suppresses backtracking on any quanitifiers or alternations inside the subpattern.

    For more about the Boyer-Moore algorithm, see:
    A Fast String Searching Algorithm, R.S. Boyer & J.S. Moore in Communcations of the Association for Computing Machinery, 20(10), 1977, pp762-772

    The classic CS text on regular expression theory (see Ch. 3):
    Compilers - Principles, Techniques, and Tools, by Aho, Sethi, and Ullman

    A regular expression is pretty smart, but it's smart like a horse. It can get distracted if it sees to much at once. So sometimes you have to put blinders onto it.
    -Programming Perl, 3rd Ed.
Re: Regexp Speed Concerns
by ColonelPanic (Friar) on Jan 12, 2001 at 19:31 UTC
    Thanks for this informative dicussion. I just sped up one of my scripts' running times dramatically, just by changing from /thing1|thing2/ to /thing1/ || /thing2/. This made a huge difference because I was doing a fairly long match over a file of several hundred lines, on a CGI script that runs many times a day. Thanks once again, Perl Monks