in reply to Re: Perl regexp matching is slow??
in thread Perl regexp matching is slow??

The big difference between these two is whether you think regular expressions should be treated as descriptive (here's the effect I want) or prescriptive (here's how I want you to do it).

Well, both are at play in the perl engine. If it was fully prescriptive then the engine wouldnt be able to to do things like using fixed string optimisations.

Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way.

Doesnt this criticism apply equally to Thompsons algorithm or to DFA construction? I would have thought the only difference would be that in a DFA youd see performance issues with compilation and not execution.

And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

Sure. But the question is will Construction time + FBM + Verification be faster for an BNFA (backtracking NFA) than for a DFA? And will the DFA consume radically more memory than the BNFA? And my position is that most likely the DFA will win only on degenerate patterns. The rest of the time my feeling is that the BNFA will win hands down, mostly because of how cheap construction is.

It's great that Perl 5.10 is going to have a hot-swappable regular expression engine, because then maybe someone could build one that handles the true-regular-expression notation in guaranteed linear time and then Perl programmers could ask for it if they wanted a guarantee of predictable behavior.

Yes, this is exactly why I was keen on making the regex engine pluggable. We have seen one proof of concept for it, but its not well tested. It would be nice to see someone do a plug in that uses a different algorithm.

Even better, that would pave a way to having Perl check for the non-regular operators, and if they weren't there, choose the linear-time implementation automatically.

I agree, it would be nice to automatically swap out to a better algorithm under some circumstances. Assuming you did so only when the matching semantics were equivelent to that provided by leftmost-longest.

Thanks for writing the article. Regardless of the debate of backtracking NFA versus more DFA like structures I think it was well written and quite informative.

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re^3: Perl regexp matching is slow??
by rsc (Acolyte) on Feb 02, 2007 at 13:17 UTC
    Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way.

    Doesnt this criticism apply equally to Thompsons algorithm or to DFA construction? I would have thought the only difference would be that in a DFA you'd see performance issues with compilation and not execution.
    It depends on the implementation. If you use Thompson's algorithm as implemented in nfa.c from the article, then you end up with a run-time that is O(m*n) for length-m regexp and length-n text.

    It is true that if you pre-compile the DFA, then you end up with O(2^m * n) time, but you can build the DFA on the fly and you're back to O(m*n), just with a smaller constant than the nfa.c version.
    And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

    Sure. But the question is will Construction time + FBM + Verification be faster for an BNFA (backtracking NFA) than for a DFA? And will the DFA consume radically more memory than the BNFA? And my position is that most likely the DFA will win only on degenerate patterns. The rest of the time my feeling is that the BNFA will win hands down, mostly because of how cheap construction is.
    This is a myth. Like most myths it is based in a truth that is no longer true. When people precomputed the entire DFA before starting the search, like the original egrep did, construction was in fact pretty expensive. But no one does that anymore. They build out the DFA on the fly, as described in the article. If you build out the DFA on the fly, then the "compilation" or "construction" passes are more or less the same as for backtracking: basically you just have to syntax check the regexp and build some internal representation of a parse tree. If you look at the big graph near the end of the article, you'll see that the nfa.c in the article has cheaper construction cost than all of the fancier implementations, by about a factor of 4. And I wasn't really trying.

    Construction costs are what you engineer them to be -- neither approach has any advantage here.

      If you look at the big graph near the end of the article, you'll see that the nfa.c in the article has cheaper construction cost than all of the fancier implementations, by about a factor of 4. And I wasn't really trying.

      Er, am i looking at the right thing? I dont see a graph covering construction time. I see a graph covering total run time but not one covering construction alone.

      Id be willing to look into hacking your code into the regex engine as an experiment. You'd have to release the code under the PAL though.

      ---
      $world=~s/war/peace/g

        Yes, you're looking at the right graph. If you look at the left hand side, that's running a match on a single-byte string, so essentially all the time is in construction. Of course, I suppose that it's using a fairly small (a?a) regex, so it's not as illustrative as I had hoped.

        What I said before is still true -- you're only parsing the regex, which is all that you do for a backtracking implementation too.

        I'd be perfectly happy to rerelease the code under an appropriate license, but I doubt you'd want it -- it's nowhere near parsing Perl's regexes. It doesn't even parse full egrep regexes. It's really only meant to be a proof of concept.