flyerhawk has asked for the wisdom of the Perl Monks concerning the following question:

I have a quick question for the gurus. I am looking to parse a bunch of logs and I have fairly large regex I am going to use to filter lines out. Am I better running a single evaluation with a large complex regex or would I be better off iterating through the various values one at a time. I assume the single evaluation would be better but I wanted to make sure. The expressions themselves are stored in a file and pulled into the script.
  • Comment on Multiple Regex evaluations or one big one?

Replies are listed 'Best First'.
Re: Multiple Regex evaluations or one big one?
by Tanktalus (Canon) on Jul 27, 2011 at 21:18 UTC

    Have you tried Benchmark to see? You might be surprised.

    Even aside from performance issues (which a single regex may or may not resolve), there is the maintenance issue. Is it easier to write, debug, add to, remove from, or generally maintain the code when it's a single regex, or as a series of regexes? Usually, a bunch of small regexes are easier to deal with than a massive one, though if you're dynamically putting the regexes together, that may not be an issue.

      Thanks for the info. So I guess it really isn't a big deal either way. I will probably just iterate out each expression then using. I think the code will look a little better that way.
Re: Multiple Regex evaluations or one big one?
by davido (Cardinal) on Jul 27, 2011 at 21:46 UTC

    "Am I better running a..." has to be qualified with what your design goal is.

    Regarding computational efficiency: Alternation is pretty efficient nowadays. But if there's a possibility of lots of backtracking, and that possibility can be avoided by breaking things up into smaller experssions, smaller expressions are the way to go.

    If the goal is readability, smaller expressions are often the way to go, though to some degree that's mitigated with the /x option and judicious use of white space.

    If the goal is easy debugging, I don't think there's much question that easier to digest chunks of regular expressions are going to be simpler to debug than a wall of regexp metamumbojumbo. (That even applies to people who have invested considerable time and effort in learning RE's inside and out as well as reading MRE, in my opinion.)


    Dave

      If there's any backtracking, you have lost the first battle already.

      The ultimate win when optimizing regular expressions is to keep the regular expression engine idle. To quote some Perl core hacker (Yves? Jarkko? Nicholas?), "Perls regular expression engine isn't fast. It often *looks* fast, but that's because the optimizer does such a good job". Splitting the pattern up in smaller parts increases the chance the regular expression can just stay in bed - and that the optimizer does all the heavy lifting.

Re: Multiple Regex evaluations or one big one?
by Marshall (Canon) on Jul 27, 2011 at 22:43 UTC
    I agree with the comment that benchmarking is very important. There have been changes as of late in the regex engine and some "old conventional wisdom" may not hold anymore. Performance is in general release dependent (and not always faster with later releases) - speed depends upon the exact situation.

    But from previous work that I've done, if you have a bunch of terms that are Or'd together X|Y, that the regex engine will do this more efficiently if it can see them all, rather than you running separate regex X, then Y.

    One module that I have works on "regex piece parts". Each small bit is tested separately, Perl builds a humongous regex with all of them Or'd together. That regex gets dynamically compiled and used. For development, I can work on one of the pieces and regression test it before getting the rest of the regex zoo involved.

    The ability of Perl to dynamically create a regex and use it is something that can't be done in C#, Java, etc. Sometimes this can work out very well. I have one piece of code that uses substr + some regex stuff + some program logic to write simple somewhat overlapping Or terms to search for specific things. This has helped me in some situations where I'm trying to match "sort of like" XYZ.

    Anyway consider the possibility of program generated dynamic regex. As Larry Wall says, "programs that write programs, are the happiest programs of all".

    Update:
    I didn't give a clear cut example of dynamic regex, so here's one that is close to a real world situation (its a big simplification of actual code): let's say that I am trying to find the word ABCD, but according to the matching rules, I am going to allow one of the letters to be wrong, for example AXCD matches. Now lets say that furthermore, I will allow a single pair of letters to be transposed (counts as one combined error). It is easy to algorithmically generate the combo's: ABCD .BCD A.CD AB.D ABC. BACD ACBD ...etc. If I use a program to generate this long sequence of Or'd terms, when the first letter is not an A, then the regex engine will immediately rule out ABCD A.CD AB.D... etc. The regex engine builds a state machine that is pretty sophisticated and it will execute quickly even if there are 30 terms in the "dumb" regex. If somebody here knows how to write a general regex that runs as quickly or actually even if you can just do it at all with one general regex, I'd like to hear about it! Regex should be able to look for words with 3,4,5,6 letters. My regex kung-foo is not up to that job.

Re: Multiple Regex evaluations or one big one?
by ~~David~~ (Hermit) on Jul 27, 2011 at 21:46 UTC
    I find that working with small regex's are easier. In situations like this, I store each regex as a hash key, and then after successfully matching one of the regex's, I delete the key so I don't have to iterate over it again.
Re: Multiple Regex evaluations or one big one?
by Anonymous Monk on Jul 28, 2011 at 08:17 UTC