in reply to Re^2: Perl regexp matching is slow??
in thread Perl regexp matching is slow??

Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way.

Doesnt this criticism apply equally to Thompsons algorithm or to DFA construction? I would have thought the only difference would be that in a DFA you'd see performance issues with compilation and not execution.
It depends on the implementation. If you use Thompson's algorithm as implemented in nfa.c from the article, then you end up with a run-time that is O(m*n) for length-m regexp and length-n text.

It is true that if you pre-compile the DFA, then you end up with O(2^m * n) time, but you can build the DFA on the fly and you're back to O(m*n), just with a smaller constant than the nfa.c version.
And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

Sure. But the question is will Construction time + FBM + Verification be faster for an BNFA (backtracking NFA) than for a DFA? And will the DFA consume radically more memory than the BNFA? And my position is that most likely the DFA will win only on degenerate patterns. The rest of the time my feeling is that the BNFA will win hands down, mostly because of how cheap construction is.
This is a myth. Like most myths it is based in a truth that is no longer true. When people precomputed the entire DFA before starting the search, like the original egrep did, construction was in fact pretty expensive. But no one does that anymore. They build out the DFA on the fly, as described in the article. If you build out the DFA on the fly, then the "compilation" or "construction" passes are more or less the same as for backtracking: basically you just have to syntax check the regexp and build some internal representation of a parse tree. If you look at the big graph near the end of the article, you'll see that the nfa.c in the article has cheaper construction cost than all of the fancier implementations, by about a factor of 4. And I wasn't really trying.

Construction costs are what you engineer them to be -- neither approach has any advantage here.

Replies are listed 'Best First'.
Re^4: Perl regexp matching is slow??
by demerphq (Chancellor) on Feb 02, 2007 at 14:06 UTC

    If you look at the big graph near the end of the article, you'll see that the nfa.c in the article has cheaper construction cost than all of the fancier implementations, by about a factor of 4. And I wasn't really trying.

    Er, am i looking at the right thing? I dont see a graph covering construction time. I see a graph covering total run time but not one covering construction alone.

    Id be willing to look into hacking your code into the regex engine as an experiment. You'd have to release the code under the PAL though.

    ---
    $world=~s/war/peace/g

      Yes, you're looking at the right graph. If you look at the left hand side, that's running a match on a single-byte string, so essentially all the time is in construction. Of course, I suppose that it's using a fairly small (a?a) regex, so it's not as illustrative as I had hoped.

      What I said before is still true -- you're only parsing the regex, which is all that you do for a backtracking implementation too.

      I'd be perfectly happy to rerelease the code under an appropriate license, but I doubt you'd want it -- it's nowhere near parsing Perl's regexes. It doesn't even parse full egrep regexes. It's really only meant to be a proof of concept.

        Of course, I suppose that it's using a fairly small (a?a) regex, so it's not as illustrative as I had hoped.

        Well the scaling issue I was thinking of would be for /a{10000}/ and things like it. You said its not necessary to unroll such a construct but im not really seeing how.

        I'd be perfectly happy to rerelease the code under an appropriate license, but I doubt you'd want it

        Thanks thats really great. Ive already started work on beefing it up some to handle the Perl syntax and Ive completed some changes that turn it into a floating matcher, which I include below.

        Regarding whether we want it, of course we do. :-) For our purposes of trying to shoehorn this stuff into the existing engine its easier to start with a minimal proof of concept than a fullblown engine. Somebody who wants that should do it as a plug in engine replacement. Me, im more concerned with making the out the package engine better and therefore a less polished and simpler framework is in many respects easier to deal with.

        Id really like to be able to swap it in on the fly. Alhough TimToady pointed something out that gives me some pause, which is that normal Perl alternation matches leftmost-longest. Which effectively means that without extra syntax (which itself would mean changes to the normal engine as well) we can't swap it in when there is alternation in place. And to make the backtracing engine do longest-token would mean trying every alternation, so adding syntax like (?|...|...) which would be longest token alternation, would mean that if thompsons algorithm was /not/ swapped in the resulting regex would be slower. (Of course if we could detect that the DFA and NFA would match different things then this wouldnt be an issue, so maybe thats the right approach.)

        Anyway, im still thinking of how to tackle Perls mixed utf8/byte-sequence encoding in a reasonably way. Probably by converting high bit non utf8 to utf8 and then treating the ut8 as a byte stream. This would also facilitate case-insensitive matches. And BTW unless I have misread the Plan-9 engine didnt seem to have very comprehensive unicode matching support, I wouldnt use it as a basis for discussing unicode regex matching, as it doesnt seem to support any of the hairy cases. Unicode matching is /not/ as simple as saying "ok we have a huge character set now", it also means dealing with combining characters and rules like those required for characters like german-sharp-ess, or greek-sigma.

        Anyway, here is a modified match routine for dfa1.c that does floating matches. Note this requires extending the Dstate struct to know that it has accepted.

        Oh, and a little nit, you can't portably use 'char' to index an array of 256 elements. On some platforms char is signed, you need to explicitly use unsigned char when doing such an indexing.

        ---
        $world=~s/war/peace/g

        Yes, you're looking at the right graph. If you look at the left hand side, that's running a match on a single-byte string, so essentially all the time is in construction.

        Unless I missed something looking at the timing scripts (.tar.gz), you're comparing the run-time of entire programs -- e.g. perl/ruby/etc programs -- versus your compiled C-code. That doesn't seem like a useful comparison of regexp algorithms.

        Are you really measuring the load time of Perl, the compilation time of the perl regexp test program and its execution and comparing that to the run time of your NFA code that is just a couple hundred lines of C?

        The other test programs all seem to be interpreted or call sh to execute a command line tool. Whereas your README suggests that the comparable nfa test program is a compiled C executable being run directly.

        -xdg

        Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

        Please do release it (and I think we all understand what proof of concept code sometimes looks like, so don't be embarrassed). There's a good possibility someone else will take it and run with it.