in reply to Re^2: Perl regexp matching is slow??
in thread Perl regexp matching is slow??

Well lets see, the title, and the lack of any kind of analysis of how a backtracking NFA can mitigate its weaknesses.

For instance the pattern used for the performance analysis could be optimised to /a{n,2n}/, and if written by a human probably would be. And the optimised construct would not exhibit the poor performance that he graphed out. In the discussion in the section marked "Real world regular expressions" he mentions that an NFA or DFA would do the opposite (unrolling the expression) but he doesnt discuss the impact of such a translation. Unrolling such a construct would result in an unmanagably large construction. A DFA that matches a{1,30000} is going to need at minum 30k states and transitions in it. Thats big. Now the same thing would be represented in a backtracking NFA like perls in what four regops, lets be generous and say that each regop is going to be 10 words (its more like 1 or 2 per regop), so we would need 40 words to represent that pattern in an backtracking-NFA. A DFA or a Thompsons NFA would 30k states, assuming each transition is only a word that is still a /huge/ difference in memory footprint, and in construction time. Memory for 30k nodes would have to be allocated blah blah. My bet is that before either a DFA or Thompsons NFA had finished construction the perl backtracking NFA would have already finished the match.

Another thing, the benchmarks showed only one side of things, the performance of an accepting match. How fast will thompson construction be when the match fails? Perls regex engine will reject the match in less than N characters comparisons because it will use FBM matching to locate the an segment. Its only when it matches that it behaves poorly, and that can be mitigated by using a better construct, which hypothetically can be optimised automatically at the engine level.

So its like the paper takes the tone "this is much better than that" and then glosses over all the hard stuff, doesnt mention the negative tradeoffs of the "better" solution, and gives no credit to the existing engines and the solutions they use, and makes no admission that there are optimisation techniques for working around the potential degenerate cases that might be encountered.

Update: tweaked some wording that wasnt clear

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re^4: Perl regexp matching is slow??
by samizdat (Vicar) on Jan 31, 2007 at 13:30 UTC
    In my humble experiaence as a has-been article writer, editors often stretch the wording of article titles to increase the apparent controversy. :(

    I totally agree with you that the author was less than balanced in his coverage, but, again, I must admit that I've fallen prey to the same lack of diligence.

    Thanks for your clarity in posting these numerous insights and details of matching for us, demerphq. I don't claim to be an expert, but your analysis will help me understand the science of matching a little better. :D

    Don Wilde
    "There's more than one level to any answer."

      but your analysis will help me understand the science of matching a little better

      No, my analysis wont help with that, for the science of matching read the original paper. My analysis is that of the merits of using a backtracking NFA as the basis for a general purpose matching engine for use in something like perl.

      My beef with the paper is purely with the criticism of Perl (and other platforms) for using a backtracking engine. I dont think its a bad decision, I think its just a decision that is motivated by considerations beyond just match time performance, and I felt that the article, given its title, should have in the name of balance or fairness or whatever at the very least discussed them. However these arent scientific issues, they are engineering issues. For scienctific stuff read the academic literature. For engineering stuff read the source of as many implementations as you can get your hands on. :-)

      ---
      $world=~s/war/peace/g

        A lot of the challenge in programming is in choosing the right tools, as you know. It is true that to understand the underlying guts, studying academic (?) papers and source code is the right way to proceed. I should have said "help me understand how to use the science of matching".

        I'm building a C extension library that embeds LISP concepts (and more) in C, and, yes, Allen's "Anatomy of LISP" helps me optimize it, but Patrick Henry Winston's "LISP" showed me what features I needed to have and gave me the insights I needed to extend it in useful ways.

        It's often difficult to see past an author's bias in a technical paper. Having experts such as yourself comment and complement the original discourse adds immeasurably to the utility of the work, and that's what I was praising you for.

        Don Wilde
        "There's more than one level to any answer."
Re^4: Perl regexp matching is slow??
by roboticus (Chancellor) on Jan 31, 2007 at 13:03 UTC
    OK ... that makes sense. I didn't really pay any attention to the title, and I tend to give writers the benefit of the doubt...

    But I'm still enjoying my attempts to make a better candy bar! 8^)

    --roboticus