in reply to Regex runs for too long

Maybe I should explain what I'm trying to do. Basically, the script accepts a multi-line search pattern as input. It then checks whether the pattern appears in the log. If so, it returns the segment of the log that matched the pattern. Otherwise, it returns nothing. The output gets parsed downstream, and specific pieces of data are pulled out (hostname, phone number, problem report, etc.) and sent to a pager.

I had originally implemented this as a pair of nested foreach loops, similar to what masem suggested. The outer loop checked one line of the log at a time, and when the line matched the top line of the pattern, the inner loop would check the rest of the lines in the pattern against the next few lines in the log.

The problem with this was that for large logs, it would take a long time to complete and use a lot of CPU in the process. I realized that 5000 lines of log data times 20 lines per search pattern times 20 search patterns means analyzing 2,000,000 log file lines at a time. I was hoping that taking advantage of perl's regex engine could help cut this back. For most data, the regex method is orders of magnitude faster. Where the foreach method would take a full minute to parse through several thousand lines of text, the regex method typically takes under a second to zip through tens of thousands of lines. However, there are a few combinations of search patterns and log segments that seem to bog it down.

I am pretty new to writing regular expressions, in general. I had tried using .*? at one point, but I've been experimenting a lot in the process of troubleshooting. I think the reason I was using  /(?:.(?!foo))*/ as opposed to  /.*?foo/ was that I didn't want to match the line delimiter. I guess that doesn't matter, though, since I s// it out later, anyway. In any case, the regex engine gets bogged down with certain data either way.

It may be that I should go back to using while and foreach loops and find other ways to optimize the process, but I was hoping there was something profoundly broken (and therefore fixable) about the way I'm using regex.

examine

Replies are listed 'Best First'.
Re (tilly) 2: Regex runs for too long
by tilly (Archbishop) on Jan 24, 2002 at 10:22 UTC
    You might be able to get some useful ideas by looking at how RE (tilly) 4: SAS log scanner works and applying that to your matching problem.

    The basic reference for understanding performance issues with REs is Mastering Regular Expressions by Friedl. The key is that the way that Perl's RE engine works is that it does a recursive search through ways of partially matching your pattern to the string, every time it fails backing up to the last choice it had and trying again. The problem is that the tree that it has to search can easily be exponential in the size of the string you are trying to match. The pattern /(foo)(.*)*\1\2/ on the string "foo and lots of garbage here..." is a canonical example. Of course Perl works hard to recognize the simple examples of this issue and special cases them so that won't be slow.

    The general answer is therefore to write your RE in such a way that the RE makes as few choices as possible, and so that where it does make choices, it makes them as late as possible so that you spend less time backtracking then redoing the same reasoning as possible. In this specific case the best you can do is watch the RE match, and as it goes, trace on pen and paper exactly what choices it is making where. You already know that the RE engine is backtracking, but trying to reason along with it should show you why it is backtracking. (The hard part of the exercise is that you will tend to leap to general conclusions that you know are right, and miss that the computer is going to grind away missing the obvious at great length...)

Re: Re: Regex runs for too long
by Juerd (Abbot) on Jan 24, 2002 at 04:32 UTC
    First, try a simple regex or index to see if the line could be what you want. Skip the rest if it isn't.
    Then, try a more complex (but not as complex as the one you already use) regex to see if you were right. If not, it's not too late to skip.
    Then, try to parse the data (split it), and validate it to check if the previous checks were right.

    If more lines are valid, use less checks, if less lines are valid, use more checks. It's all about knowing your data. (If all lines are valid, use no checks :)

    while (<>) { next unless /first check/; next unless /second one/; my @foo = split /delimiter/; next if @foo != 5; # or whatever ... } <p><font color=green><code> 2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$