in reply to Re: Perl regexp matching is slow??
in thread Perl regexp matching is slow??

Hmmm ... I read the paper and didn't see anything that resembled an attack. To me it felt more like:

"Hey gang, there was a fork in regex technology XX years ago and the fork we're on now has some pathological cases. This other fork doesn't have those pathological cases, but has these other shortcomings. Perhaps we can take the creamy caramel center of the NFA and wrap them in the chocolaty goodness of DFA and get something much tastier?"

I totally agree with your reasoning of why we don't need to go that route. On the other hand, since reading the paper, my gears have been a-grinding with attempts at finding a way to merge backtracking into DFA without consuming insane amounts of storage. It's quite an entertaining mental exercise, even though I've gotten nowhere yet. (I don't actually expect to get anywhere, but it's still fun to try...)

--roboticus

Replies are listed 'Best First'.
Re^3: Perl regexp matching is slow??
by demerphq (Chancellor) on Jan 31, 2007 at 12:45 UTC

    Well lets see, the title, and the lack of any kind of analysis of how a backtracking NFA can mitigate its weaknesses.

    For instance the pattern used for the performance analysis could be optimised to /a{n,2n}/, and if written by a human probably would be. And the optimised construct would not exhibit the poor performance that he graphed out. In the discussion in the section marked "Real world regular expressions" he mentions that an NFA or DFA would do the opposite (unrolling the expression) but he doesnt discuss the impact of such a translation. Unrolling such a construct would result in an unmanagably large construction. A DFA that matches a{1,30000} is going to need at minum 30k states and transitions in it. Thats big. Now the same thing would be represented in a backtracking NFA like perls in what four regops, lets be generous and say that each regop is going to be 10 words (its more like 1 or 2 per regop), so we would need 40 words to represent that pattern in an backtracking-NFA. A DFA or a Thompsons NFA would 30k states, assuming each transition is only a word that is still a /huge/ difference in memory footprint, and in construction time. Memory for 30k nodes would have to be allocated blah blah. My bet is that before either a DFA or Thompsons NFA had finished construction the perl backtracking NFA would have already finished the match.

    Another thing, the benchmarks showed only one side of things, the performance of an accepting match. How fast will thompson construction be when the match fails? Perls regex engine will reject the match in less than N characters comparisons because it will use FBM matching to locate the an segment. Its only when it matches that it behaves poorly, and that can be mitigated by using a better construct, which hypothetically can be optimised automatically at the engine level.

    So its like the paper takes the tone "this is much better than that" and then glosses over all the hard stuff, doesnt mention the negative tradeoffs of the "better" solution, and gives no credit to the existing engines and the solutions they use, and makes no admission that there are optimisation techniques for working around the potential degenerate cases that might be encountered.

    Update: tweaked some wording that wasnt clear

    ---
    $world=~s/war/peace/g

      In my humble experiaence as a has-been article writer, editors often stretch the wording of article titles to increase the apparent controversy. :(

      I totally agree with you that the author was less than balanced in his coverage, but, again, I must admit that I've fallen prey to the same lack of diligence.

      Thanks for your clarity in posting these numerous insights and details of matching for us, demerphq. I don't claim to be an expert, but your analysis will help me understand the science of matching a little better. :D

      Don Wilde
      "There's more than one level to any answer."

        but your analysis will help me understand the science of matching a little better

        No, my analysis wont help with that, for the science of matching read the original paper. My analysis is that of the merits of using a backtracking NFA as the basis for a general purpose matching engine for use in something like perl.

        My beef with the paper is purely with the criticism of Perl (and other platforms) for using a backtracking engine. I dont think its a bad decision, I think its just a decision that is motivated by considerations beyond just match time performance, and I felt that the article, given its title, should have in the name of balance or fairness or whatever at the very least discussed them. However these arent scientific issues, they are engineering issues. For scienctific stuff read the academic literature. For engineering stuff read the source of as many implementations as you can get your hands on. :-)

        ---
        $world=~s/war/peace/g

      OK ... that makes sense. I didn't really pay any attention to the title, and I tend to give writers the benefit of the doubt...

      But I'm still enjoying my attempts to make a better candy bar! 8^)

      --roboticus