Multiple Regex evaluations or one big one?

flyerhawk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Multiple Regex evaluations or one big one? by Tanktalus (Canon) on Jul 27, 2011 at 21:18 UTC
Have you tried Benchmark to see? You might be surprised. Even aside from performance issues (which a single regex may or may not resolve), there is the maintenance issue. Is it easier to write, debug, add to, remove from, or generally maintain the code when it's a single regex, or as a series of regexes? Usually, a bunch of small regexes are easier to deal with than a massive one, though if you're dynamically putting the regexes together, that may not be an issue.	[reply]
Re^2: Multiple Regex evaluations or one big one? by flyerhawk (Novice) on Jul 27, 2011 at 22:19 UTC
Thanks for the info. So I guess it really isn't a big deal either way. I will probably just iterate out each expression then using. I think the code will look a little better that way.	[reply]
Re: Multiple Regex evaluations or one big one? by davido (Cardinal) on Jul 27, 2011 at 21:46 UTC
"Am I better running a..." has to be qualified with what your design goal is. Regarding computational efficiency: Alternation is pretty efficient nowadays. But if there's a possibility of lots of backtracking, and that possibility can be avoided by breaking things up into smaller experssions, smaller expressions are the way to go. If the goal is readability, smaller expressions are often the way to go, though to some degree that's mitigated with the /x option and judicious use of white space. If the goal is easy debugging, I don't think there's much question that easier to digest chunks of regular expressions are going to be simpler to debug than a wall of regexp metamumbojumbo. (That even applies to people who have invested considerable time and effort in learning RE's inside and out as well as reading MRE, in my opinion.) Dave	[reply]
Re^2: Multiple Regex evaluations or one big one? by JavaFan (Canon) on Jul 27, 2011 at 22:53 UTC
If there's any backtracking, you have lost the first battle already. The ultimate win when optimizing regular expressions is to keep the regular expression engine idle. To quote some Perl core hacker (Yves? Jarkko? Nicholas?), "Perls regular expression engine isn't fast. It often looks fast, but that's because the optimizer does such a good job". Splitting the pattern up in smaller parts increases the chance the regular expression can just stay in bed - and that the optimizer does all the heavy lifting.	[reply]
Re: Multiple Regex evaluations or one big one? by Marshall (Canon) on Jul 27, 2011 at 22:43 UTC
I agree with the comment that benchmarking is very important. There have been changes as of late in the regex engine and some "old conventional wisdom" may not hold anymore. Performance is in general release dependent (and not always faster with later releases) - speed depends upon the exact situation. But from previous work that I've done, if you have a bunch of terms that are Or'd together X\|Y, that the regex engine will do this more efficiently if it can see them all, rather than you running separate regex X, then Y. One module that I have works on "regex piece parts". Each small bit is tested separately, Perl builds a humongous regex with all of them Or'd together. That regex gets dynamically compiled and used. For development, I can work on one of the pieces and regression test it before getting the rest of the regex zoo involved. The ability of Perl to dynamically create a regex and use it is something that can't be done in C#, Java, etc. Sometimes this can work out very well. I have one piece of code that uses substr + some regex stuff + some program logic to write simple somewhat overlapping Or terms to search for specific things. This has helped me in some situations where I'm trying to match "sort of like" XYZ. Anyway consider the possibility of program generated dynamic regex. As Larry Wall says, "programs that write programs, are the happiest programs of all". Update: I didn't give a clear cut example of dynamic regex, so here's one that is close to a real world situation (its a big simplification of actual code): let's say that I am trying to find the word ABCD, but according to the matching rules, I am going to allow one of the letters to be wrong, for example AXCD matches. Now lets say that furthermore, I will allow a single pair of letters to be transposed (counts as one combined error). It is easy to algorithmically generate the combo's: ABCD .BCD A.CD AB.D ABC. BACD ACBD ...etc. If I use a program to generate this long sequence of Or'd terms, when the first letter is not an A, then the regex engine will immediately rule out ABCD A.CD AB.D... etc. The regex engine builds a state machine that is pretty sophisticated and it will execute quickly even if there are 30 terms in the "dumb" regex. If somebody here knows how to write a general regex that runs as quickly or actually even if you can just do it at all with one general regex, I'd like to hear about it! Regex should be able to look for words with 3,4,5,6 letters. My regex kung-foo is not up to that job.	[reply]
Re: Multiple Regex evaluations or one big one? by ~~David~~ (Hermit) on Jul 27, 2011 at 21:46 UTC
I find that working with small regex's are easier. In situations like this, I store each regex as a hash key, and then after successfully matching one of the regex's, I delete the key so I don't have to iterate over it again.	[reply]
Re: Multiple Regex evaluations or one big one? by Anonymous Monk on Jul 28, 2011 at 08:17 UTC
In addition to what the other monks have already said (maintainability, etc etc) There are many variables and results can vary if you change any one, be it hardware, perl version/config, data, regex ... and your benchmarks should reflect that, so you can figure out whether its worth doing. See two such benchmarks Re^3: Increasing the efficiency of a viral clonal expansion model, Re^4: Hamming Distance Between 2 Strings - Fast(est) Way? See also, on benchmarking and (micro)optimizations, regex engine changes... http://perldoc.perl.org/perlvar.html#%24{^RE_TRIE_MAXBUF} speeding up a regex Regex combining /(foo\|bar)/ slower than using foreach (/foo/,/bar/) ??? /o is dead, long live qr//! Performance of possessive quantifiers How useful is the /o regexp modifier? Re^2: Never (qr//) Regexp::Assemble/Regexp::Trie/Regex::PreSuf Don't believe; measure. Why "Modern Perl" is slower than "Legacy Perl"? Perl regexp matching is slow?? Perl Is So Slow, but Why? Why does a Perl 5.6 regex run a lot slower on Perl 5.8? Regexes are slow (or, why I advocate String::Index) Benchmarking regex alternation Benchmarking the basic operations if you need to use Benchmark.pm in order figure out which of two ways is faster, then it probably isn't worth your time to be worrying about that "optimization". Using Benchmark.pm is a good way to figure out how much faster something is. --tye Making Programs Faster (Benchmarking, Profiling, and Performance Tuning) mod_perl: Performance Tuning CHAPTER 3 - Measurement Is Everything "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." --Donald Knuth Re^2: How do I quickly strip blank space from the beginning/end of a string? Devel::NYTProf Optimising processing for large data files. Speeding up the DBI No More Meaningless Benchmarks! Mastering Perl: Benchmarking Perl Don't turn off your thinking cap Wasting time thinking about wasted time CPU cycles DO NOT MATTER! Performance, Abstraction and HOP Re^4: Performance, Abstraction and HOP	[reply]