in reply to Perl regexp matching is slow??

I have various feelings about this paper. I think if it weren't a veiled attack on the existing corpus of regex engine work it would be absolutely fantastic. If you read it as an explanation of using NFA and DFA's to implement kleene algerbras then its great. If you read it as an argument for why you shouldnt use backtracking NFA engines or from the point of view of software engineering in my opinion it fails to impress.

A general purpose regex engine like that required for perl has to be able to do a lot, and has to balance considerations ranging from memory footprint of a compiled object, construction time, flexibility, rich feature-sets, the ability to accomodate huge character sets, and of course most importantly matching performance. And it turns out that while DFA engines have a very good worst case match time, they dont actually have too many other redeeming features. Construction can be extremely slow, the memory footprint vast, all kinds of trickery is involved to do unicode or capturing properly and they aren't suitable for patterns with backreferences.

It turns out that for a general purpose engine, most of the patterns that people want to match have characteristics that mean that the search space that requires a "real" regex engine can be efficiently pruned. For instance, /most/ patterns people want to match in real life have a fixed substring in them and the beginning and end of the match can be efficiently determined from the location of that substring in the string being matched. This means that specialized string matching algorithms that are more efficient than DFA's (like Fast Boyer Moore) can be utilized to find this substring and only when it is found is the real engine used to verify that the match is valid. In many cases this can result in much faster match times than a DFA can provide, even when the verifier is an NFA with backtracking.

On a similar line a general purpose engine rarely needs to deal with "degenerate patterns" of the type illustrated in the article and many patterns that are degenerate in the current engine need not be. Auto-possessivization is an open optimisation that would go a long way to reducing the number of patterns that would result in degenerate matching, which is itself blocked from being too bad by the superlinear cache. In a similar vein an NFA can be made more efficient by judicious use of explicit atomic subpatterns and possessive quantifiers, both of which are a form of compiler hinting.

So, on the balance it turns out that an NFA can perform quite competitively with a DFA, while at the same time being much more flexible and powerful. Sure there will be classes of patterns that NFA will not do well, but on the other hand there are patterns that a DFA cannot do at all which is less useful than doing them slowly. At then end of the day I feel the balance of factors is on the backtracking NFA's side. I think there is a reason why there aren't too many industrial quality DFA engines out there, they simply arent suited for the job.

I think the truth is that for the messy real life problem of "pattern matching" no one approach on its own will suffice. It seems to me that a hybrid approach is called for. One that blends the benefits of the DFA with the benefits of the NFA. In fact this is the direction that things have been going in the Perl engine. An NFA framework with DFA-like components. In Perl 5.10 we introduced a TRIE regop which allows for efficient matching of alternations starting with literal text and with a bit more work the framework used there could be extended to handle DFA type construction for degenerate patterns like the one in that article.

Anyway, In perl 5.10 youll be able to use whatever regex plug in you like, so perhaps we will see Russ Cox contribute an implementation to prove his case.

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re^2: Perl regexp matching is slow??
by diotalevi (Canon) on Jan 30, 2007 at 16:37 UTC

    I've been thinking of the "atomic subpatterns" and possessive quantifiers as pruning operations. The entire search space may be exponential but entire branches can be eliminated if you just code your pruning operations into your expressions.

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      Exactly, they are hints to tell the matching engine not to bother backtracking. Hypothetically the optimiser should be able to determine them all automatically and you shouldnt need them, but its a lot easier to let people do it themselves in terms of implementation.

      Auto-possessiveification is something that sure does need doing. If you consider the general case XqY, where X and Y are literals and q is a quantifer, you should be able to do Xq+Y whenever X cannot overlap Y. Ive not got round to it yet tho.

      ---
      $world=~s/war/peace/g

Re^2: Perl regexp matching is slow??
by roboticus (Chancellor) on Jan 31, 2007 at 11:15 UTC
    Hmmm ... I read the paper and didn't see anything that resembled an attack. To me it felt more like:

    "Hey gang, there was a fork in regex technology XX years ago and the fork we're on now has some pathological cases. This other fork doesn't have those pathological cases, but has these other shortcomings. Perhaps we can take the creamy caramel center of the NFA and wrap them in the chocolaty goodness of DFA and get something much tastier?"

    I totally agree with your reasoning of why we don't need to go that route. On the other hand, since reading the paper, my gears have been a-grinding with attempts at finding a way to merge backtracking into DFA without consuming insane amounts of storage. It's quite an entertaining mental exercise, even though I've gotten nowhere yet. (I don't actually expect to get anywhere, but it's still fun to try...)

    --roboticus

      Well lets see, the title, and the lack of any kind of analysis of how a backtracking NFA can mitigate its weaknesses.

      For instance the pattern used for the performance analysis could be optimised to /a{n,2n}/, and if written by a human probably would be. And the optimised construct would not exhibit the poor performance that he graphed out. In the discussion in the section marked "Real world regular expressions" he mentions that an NFA or DFA would do the opposite (unrolling the expression) but he doesnt discuss the impact of such a translation. Unrolling such a construct would result in an unmanagably large construction. A DFA that matches a{1,30000} is going to need at minum 30k states and transitions in it. Thats big. Now the same thing would be represented in a backtracking NFA like perls in what four regops, lets be generous and say that each regop is going to be 10 words (its more like 1 or 2 per regop), so we would need 40 words to represent that pattern in an backtracking-NFA. A DFA or a Thompsons NFA would 30k states, assuming each transition is only a word that is still a /huge/ difference in memory footprint, and in construction time. Memory for 30k nodes would have to be allocated blah blah. My bet is that before either a DFA or Thompsons NFA had finished construction the perl backtracking NFA would have already finished the match.

      Another thing, the benchmarks showed only one side of things, the performance of an accepting match. How fast will thompson construction be when the match fails? Perls regex engine will reject the match in less than N characters comparisons because it will use FBM matching to locate the an segment. Its only when it matches that it behaves poorly, and that can be mitigated by using a better construct, which hypothetically can be optimised automatically at the engine level.

      So its like the paper takes the tone "this is much better than that" and then glosses over all the hard stuff, doesnt mention the negative tradeoffs of the "better" solution, and gives no credit to the existing engines and the solutions they use, and makes no admission that there are optimisation techniques for working around the potential degenerate cases that might be encountered.

      Update: tweaked some wording that wasnt clear

      ---
      $world=~s/war/peace/g

        In my humble experiaence as a has-been article writer, editors often stretch the wording of article titles to increase the apparent controversy. :(

        I totally agree with you that the author was less than balanced in his coverage, but, again, I must admit that I've fallen prey to the same lack of diligence.

        Thanks for your clarity in posting these numerous insights and details of matching for us, demerphq. I don't claim to be an expert, but your analysis will help me understand the science of matching a little better. :D

        Don Wilde
        "There's more than one level to any answer."
        OK ... that makes sense. I didn't really pay any attention to the title, and I tend to give writers the benefit of the doubt...

        But I'm still enjoying my attempts to make a better candy bar! 8^)

        --roboticus