I have various feelings about this paper. I think if it weren't a veiled attack on the existing corpus of regex engine work it would be absolutely fantastic. If you read it as an explanation of using NFA and DFA's to implement kleene algerbras then its great. If you read it as an argument for why you shouldnt use backtracking NFA engines or from the point of view of software engineering in my opinion it fails to impress.

A general purpose regex engine like that required for perl has to be able to do a lot, and has to balance considerations ranging from memory footprint of a compiled object, construction time, flexibility, rich feature-sets, the ability to accomodate huge character sets, and of course most importantly matching performance. And it turns out that while DFA engines have a very good worst case match time, they dont actually have too many other redeeming features. Construction can be extremely slow, the memory footprint vast, all kinds of trickery is involved to do unicode or capturing properly and they aren't suitable for patterns with backreferences.

It turns out that for a general purpose engine, most of the patterns that people want to match have characteristics that mean that the search space that requires a "real" regex engine can be efficiently pruned. For instance, /most/ patterns people want to match in real life have a fixed substring in them and the beginning and end of the match can be efficiently determined from the location of that substring in the string being matched. This means that specialized string matching algorithms that are more efficient than DFA's (like Fast Boyer Moore) can be utilized to find this substring and only when it is found is the real engine used to verify that the match is valid. In many cases this can result in much faster match times than a DFA can provide, even when the verifier is an NFA with backtracking.

On a similar line a general purpose engine rarely needs to deal with "degenerate patterns" of the type illustrated in the article and many patterns that are degenerate in the current engine need not be. Auto-possessivization is an open optimisation that would go a long way to reducing the number of patterns that would result in degenerate matching, which is itself blocked from being too bad by the superlinear cache. In a similar vein an NFA can be made more efficient by judicious use of explicit atomic subpatterns and possessive quantifiers, both of which are a form of compiler hinting.

So, on the balance it turns out that an NFA can perform quite competitively with a DFA, while at the same time being much more flexible and powerful. Sure there will be classes of patterns that NFA will not do well, but on the other hand there are patterns that a DFA cannot do at all which is less useful than doing them slowly. At then end of the day I feel the balance of factors is on the backtracking NFA's side. I think there is a reason why there aren't too many industrial quality DFA engines out there, they simply arent suited for the job.

I think the truth is that for the messy real life problem of "pattern matching" no one approach on its own will suffice. It seems to me that a hybrid approach is called for. One that blends the benefits of the DFA with the benefits of the NFA. In fact this is the direction that things have been going in the Perl engine. An NFA framework with DFA-like components. In Perl 5.10 we introduced a TRIE regop which allows for efficient matching of alternations starting with literal text and with a bit more work the framework used there could be extended to handle DFA type construction for degenerate patterns like the one in that article.

Anyway, In perl 5.10 youll be able to use whatever regex plug in you like, so perhaps we will see Russ Cox contribute an implementation to prove his case.

---
$world=~s/war/peace/g


In reply to Re: Perl regexp matching is slow?? by demerphq
in thread Perl regexp matching is slow?? by smahesh

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.