in reply to Perl regexp matching is slow??

I wrote the article. Whether you think it applies to Perl depends on some usually-unstated beliefs about regular expressions and how they should be used.

Regular expressions started out as a notation for describing the strings that they match: a(bb)*a says "first an a, then some number of bb, then an a". They didn't say anything about how you had to take an input string and check whether it matched, and in fact there were multiple algorithms used for doing so. As a computational measure, regular expressions describe sets of strings that can be matched in linear time with constant space, using an algorithm that isn't obvious from looking at the regular expression notation.

Perl's regular expressions have come to be viewed as a notation for an algorithm that matches strings. Operators like backreferences started the trend long before Perl, but new operators like recursion and, to a lesser extend, lookahead and lookbehind assertions, certainly continued the trend quite a ways beyond where it had been. These features are most easily defined in terms of the underlying matching, not in terms of the strings they describe. And of course, as everyone loves to say, Perl's regular expressions are not regular expressions anymore. (While this is true, a large fraction of the ones actually written are true regular expressions and could be treated as such.)

The big difference between these two is whether you think regular expressions should be treated as descriptive (here's the effect I want) or prescriptive (here's how I want you to do it). People who accept the existence of so-called pathological regular expressions implicitly treat regular expressions as prescriptive: "we gave you a lot of rope, so of course you can manage to hang yourself." I believe that until one crosses into the not-too-often-visited land of exotic Perl regular expression features like generalized assertions, recursion, and backreferences, it is best to treat regular expressions as descriptive. Then the regular expression engine can implement the expression efficiently no matter what you've typed. Continuing the strained analogy from before, "you can have all the rope you need, but the regex engine is going to do the knot-tying for you once you tell it what you want to do, so there is no way you can hang yourself."

In the prescriptive regex mindset, programmers have to microoptimize their regular expressions, knowing how they get implemented in order to avoid so-called pathological behavior by avoiding potentially-expensive operations except when he is sure (and hopefully correct!) that "they won't be expensive in this particular case." In the descriptive regex mindset, there are no pathological regular expressions. If there is some required substring, then fine, do that as a further optimization. But the programmer can be guaranteed that running a regular expression will not be expensive, unless it uses advanced features like assertions, recursion, or backreferences. Right now * and ? are potentially-expensive features, and they needn't be.

If the Perl programmer is the one writing the regular expressions, then maybe there isn't much benefit to not having to learn how to "optimize" regular expressions. After all, one has already had to learn Perl; learning how to write regular expressions that Perl implements efficiently is not much more work. However, if the Perl programmer isn't the one writing the regular expressions, then there is a lot of benefit to knowing that a regular expression search isn't going to run for years and years. As an example, let's consider a CGI script that is going to accept regular expressions as search keys. The programmer needs to execute those searches, and it would be very useful if Perl could be relied upon to run them all efficiently, or at least to flag the ones that might not run efficiently because they have things like recursion. Perhaps there could be a regex flag that says "run this in linear time, and if you can't, don't run it at all". Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way. Lest you think the CGI example is contrived, consider Jeffrey Friedl's Japanese-English dictionary, which accepts regular expressions as search terms. The regexp lookup is really slow already (maybe the host machines are overloaded) but I bet it would be very very slow if you gave it a regular expression like ((((.+)+)+)+)+(aaa|bbb). (Cutting off the search isn't quite correct here, since there are some entries that do match aaa, at least if you are searching for English.)

It's great that Perl 5.10 is going to have a hot-swappable regular expression engine, because then maybe someone could build one that handles the true-regular-expression notation in guaranteed linear time and then Perl programmers could ask for it if they wanted a guarantee of predictable behavior. Even better, that would pave a way to having Perl check for the non-regular operators, and if they weren't there, choose the linear-time implementation automatically.

In short, the article is not about making Perl's regex matching 100x faster across all programs. It is about making it more predictable and consistent across all programs. And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

Replies are listed 'Best First'.
Re^2: Perl regexp matching is slow??
by TimToady (Parson) on Jan 31, 2007 at 20:54 UTC
    Thanks. You'll get some pushback from the curmudgeons, but by and large I think we're a healthy enough community that we're actually trying to learn from our mistakes, at least every now and then. Certainly that applies to regex syntax; I think it's somewhat ironic that, just as the rest of the world is adopting Perl 5 regex syntax, Perl 6 is abandoning it. :-)

    Anyway, I'd be very interested in any suggestions you might have regarding the hybrid approach we're taking in Synopsis 5, especially the implicit longest-token semantics.

      It certainly sounds like distinguishing | and || is a good idea. I do not claim to fully understand everything in Synopsis 5, but the general idea seems like being on the right track.

      One interesting gotcha (I can't tell whether it is handled correctly or not). Consider the Perl 5 regular expression (.*)\1 run on (("a" x $n) . "b") x 2. It should match. However, figuring that out requires trying many possible matches for the (.*) part: the first try would be the whole string, then the whole string minus one char, etc., until you got down to just half the string.

      It is not clear to me that the longest-token semantics would get this right: it sounds like they would grab the whole string during (.*), fail to match in the equivalent of \1, and then give up.

      All this is just to say that backreferences are even harder than they first appear, not that the longest-token idea is flawed.

      Going by my experience with the trie optimisation the only difference between leftmost-longest and longest-token will be that you dont need to sort the possible accepting pathways through an alternation when cycling through them to find the one that allows the following pattern to succeed. Assuming you use a stack to track all the possible accepting pathways then in the case of longest-token they will already be sorted so you just keep popping them off the stack until one matches. The trie code however has to somehow process them in the order they occur in the alternation, which in the perl 5.9.5 implementation involves a linear probe through the list to find the best match each time. (The assumption is that the list will most often be very small.)

      ---
      $world=~s/war/peace/g

Re^2: Perl regexp matching is slow??
by demerphq (Chancellor) on Feb 01, 2007 at 08:08 UTC

    The big difference between these two is whether you think regular expressions should be treated as descriptive (here's the effect I want) or prescriptive (here's how I want you to do it).

    Well, both are at play in the perl engine. If it was fully prescriptive then the engine wouldnt be able to to do things like using fixed string optimisations.

    Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way.

    Doesnt this criticism apply equally to Thompsons algorithm or to DFA construction? I would have thought the only difference would be that in a DFA youd see performance issues with compilation and not execution.

    And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

    Sure. But the question is will Construction time + FBM + Verification be faster for an BNFA (backtracking NFA) than for a DFA? And will the DFA consume radically more memory than the BNFA? And my position is that most likely the DFA will win only on degenerate patterns. The rest of the time my feeling is that the BNFA will win hands down, mostly because of how cheap construction is.

    It's great that Perl 5.10 is going to have a hot-swappable regular expression engine, because then maybe someone could build one that handles the true-regular-expression notation in guaranteed linear time and then Perl programmers could ask for it if they wanted a guarantee of predictable behavior.

    Yes, this is exactly why I was keen on making the regex engine pluggable. We have seen one proof of concept for it, but its not well tested. It would be nice to see someone do a plug in that uses a different algorithm.

    Even better, that would pave a way to having Perl check for the non-regular operators, and if they weren't there, choose the linear-time implementation automatically.

    I agree, it would be nice to automatically swap out to a better algorithm under some circumstances. Assuming you did so only when the matching semantics were equivelent to that provided by leftmost-longest.

    Thanks for writing the article. Regardless of the debate of backtracking NFA versus more DFA like structures I think it was well written and quite informative.

    ---
    $world=~s/war/peace/g

      Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way.

      Doesnt this criticism apply equally to Thompsons algorithm or to DFA construction? I would have thought the only difference would be that in a DFA you'd see performance issues with compilation and not execution.
      It depends on the implementation. If you use Thompson's algorithm as implemented in nfa.c from the article, then you end up with a run-time that is O(m*n) for length-m regexp and length-n text.

      It is true that if you pre-compile the DFA, then you end up with O(2^m * n) time, but you can build the DFA on the fly and you're back to O(m*n), just with a smaller constant than the nfa.c version.
      And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

      Sure. But the question is will Construction time + FBM + Verification be faster for an BNFA (backtracking NFA) than for a DFA? And will the DFA consume radically more memory than the BNFA? And my position is that most likely the DFA will win only on degenerate patterns. The rest of the time my feeling is that the BNFA will win hands down, mostly because of how cheap construction is.
      This is a myth. Like most myths it is based in a truth that is no longer true. When people precomputed the entire DFA before starting the search, like the original egrep did, construction was in fact pretty expensive. But no one does that anymore. They build out the DFA on the fly, as described in the article. If you build out the DFA on the fly, then the "compilation" or "construction" passes are more or less the same as for backtracking: basically you just have to syntax check the regexp and build some internal representation of a parse tree. If you look at the big graph near the end of the article, you'll see that the nfa.c in the article has cheaper construction cost than all of the fancier implementations, by about a factor of 4. And I wasn't really trying.

      Construction costs are what you engineer them to be -- neither approach has any advantage here.

        If you look at the big graph near the end of the article, you'll see that the nfa.c in the article has cheaper construction cost than all of the fancier implementations, by about a factor of 4. And I wasn't really trying.

        Er, am i looking at the right thing? I dont see a graph covering construction time. I see a graph covering total run time but not one covering construction alone.

        Id be willing to look into hacking your code into the regex engine as an experiment. You'd have to release the code under the PAL though.

        ---
        $world=~s/war/peace/g

Re^2: Perl regexp matching is slow??
by Anonymous Monk on Feb 27, 2007 at 08:35 UTC
    In order to satisfy that "descriptive" you would have to build DFA if for no other reason then to make all |-s equal and to deduce non-greedy multipliers. Just to remind you that once you cross the Rubicon to "descriptive" non-greedy intent has to be deduced.
    So, at the very least, you would incur perf hit all the time for the sake of a fringe case. Doesn't really matter if you would hide it behind the dynamic building. So if the DFA has the exponential explosion of states you would eventually reach the point of explosion. Sorry I can't find the one that Ullman used to throw at unsuspecting souls :-), but I'm sure you'll find enough of these on the net ...
    You might get away with a stuff like "on the fly" DFA for something that is one-time run, but in such case the whole debate is artificial and academic.
    For a real world run, when say I have a regex compiled all the way to the binary, say in .NET, and am paying the price of never being able to "unload" it for the lifetime of the binary, and the binary happens to be a server required to hit 99.9999 and is running that regexp on all incoming text ... I think you get the picture :-) and what's the purpose of non-greedy multipliers and all these assertions you dislike so much and why even a thought that someone might be building an DFA on the fly is not a realistic option. Even if it could handle back-references, or more importantly a recursive parse with implicit stack - which it can't.
    See, we actually need the interpretative, ahem interpretation :-) of the regex and the one that is stable across languages and platforms - thereby the PCRE and the only real "price" we "pay" is that we use non-greedy multipliers and we are done. Most horrible thing, isn't it :-)
    Now we could debate that we might prefer to have non-greedy as a default but that's a historical note :-) Also a matter of how precisely one wants to define the input he'll take in. Some people may insist on using negative classes instead, but they don't write regexps to match a in a :-).
    As for the back-refs and recursive parse, that's going to be more in demand, not less. For example .NET implementation of Perl regex was designed to be able to parse XML without a perf hit and while I couldn't say if Perl would take a perf hit for something like that, it definitely can parse XML with regex much easier then .NET.
    So, you can go cursing everyone, starting with Thompson himself :-) or you can kind of come to terms with the fact that regex engines in widespread use are practical tools and not academic toys. So, claiming that one engine is "slow" because it will have a problem with a ridiculous thing like a?{3}a{3} on "aaa" but will run a{0,30}?a{30} without a hiccup looks like the ultimate exercise in futility. If for some reason you really, honestly need to match something like that, well you can always write a regex, to transform "a?{n}a{n}" into "a{0,n}?a{n}" :-)) Might need back-refs to handle more convoluted cases though :-)))))
    In the meantime I will happily cherish Perl/PCRE/.NET etc. regex in all it's universal presence and benefits it brings - for all people with real needs for fast string parsing.

      As for the back-refs and recursive parse, that's going to be more in demand, not less. For example .NET implementation of Perl regex was designed to be able to parse XML without a perf hit and while I couldn't say if Perl would take a perf hit for something like that, it definitely can parse XML with regex much easier then .NET.

      Im not sure what performance hit you mean. Although ambrus had some neat ideas for something like the super-linear cache with regards to pattern level recursion that i have yet to follow up on.

      Also juerd has published a validation regex for xml that is derived from the EBNF that w3 has posted. The story that you cant parse XML or HTML with regexes is going to go up in a puff of smoke when Perl 5.10 comes out.

      ---
      $world=~s/war/peace/g

        The problem is that it's still tricky to get at the parse tree. All the parts are there -- named captures, relative backreferences, recursion -- but you still have to litter your regexp with hairy use of
        (?{ local $FOO{blah} = $^N })
        to get the data back.

        It should be possible to whip something up with re::engine and Regexp::Parser, but a brief attempt showed me that this is somewhat tricky to get right except in simple cases. It would be much more convenient to have this support built into the regex engine. Perhaps for 5.12...