examine has asked for the wisdom of the Perl Monks concerning the following question:

When I execute the following code with certain data, it takes a long, long time (~10 seconds on a fairly fast machine) to return.

if ($text =~ /($searchString)/) { $_ = $1; s/!CR!/\n/g; print "$_\n"; } else { print "No match\n"; }

These are the values of $searchString and $text going into the regular expression. I apologize for the length, but this is as much as I've been able to narrow down the problem.

$searchString : (?:.(?!!CR!))*.?-------(?:.(?!!CR!))*.?!CR!(?:.(?!!CR! +))*.?!CR!(?:.(?!!CR!))*.?!CR!(?:.(?!!CR!))*.?node(?:.(?!!CR!))*.?!CR! +(?:.(?!!CR!))*.?id(?:.(?!!CR!))*.?NEW ALERT(?:.(?!!CR!))*.?!CR!(?:.(? +!!CR!))*.?borg(?:.(?!!CR!))*.?!CR!(?:.(?!!CR!))*.?subsys(?:.(?!!CR!)) +*.?!CR!(?:.(?!!CR!))*.?severity(?:.(?!!CR!))*.?!CR!(?:.(?!!CR!))*.?!C +R!(?:.(?!!CR!))*.?nohup(?:.(?!!CR!))*.?!CR!(?:.(?!!CR!))*.?Id(?:.(?!! +CR!))*.?!CR!(?:.(?!!CR!))*.?Team(?:.(?!!CR!))*.?!CR!(?:.(?!!CR!))*.?! +CR!(?:.(?!!CR!))*.?!CR!(?:.(?!!CR!))*.?REPORT(?:.(?!!CR!))*.?Resource +(?:.(?!!CR!))*.?slow -(?:.(?!!CR!))*.? $text : ------------------------------------------------------------!C +R!!CR!11/19/2001 16:20:23!CR!node_name[sanitized]!CR!id[987987] NEW A +LERT!CR!borg[this.also.sanitized]!CR!subsystem[this.too] ERR-5555!CR! +severity[warning] group[sanitized-host]!CR!!CR!nohup /usr/local/bin/n +otreal!CR!Id: 987987!CR!Team[TEST_Services]!CR!!CR!ACTION: Send Voice + Mail to [555-3423 555-9922]!CR!REPORT: Process syslog has crashed " +garbage garbage!CR!-------------------------------------------------- +----------!CR!!CR!

I have been looking at the output generated by use re 'debug'; but without much success. I can see that it seems to loop over the same string over and over, but I'm not sure why.

I would enormously appreciate any insight.

Examine

Replies are listed 'Best First'.
Re: Regex runs for too long
by Masem (Monsignor) on Jan 23, 2002 at 23:40 UTC
    Do you *really* want to have a regex like that? I think the regex is the wrong solution. It's probably better to use split to get out the chucks of data and then decide what's important, for example:
    while ( my $line = <FILE> ) { my @items = split "!CR!", $line; if ( expected ( @items ) ) { print join "\n", @items; } } # This would be need to be done to taste sub expected { my @data = @_; my $matched = 0; # Say you need to make sure the id line is there... foreach my $item ( @data ) { if ( $item =~ /^id\[\d+\]/ ) { $matched++; last } } return $matched; }
    This solution I would suspect, regardless of how you write the matching routing, will be faster than the work you're trying to put on the regex engine, and this way would be much more understandable to other coders that might need it. If you need something more specific, tell us what you need to pull out from this text area and we can help more if needed.

    -----------------------------------------------------
    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
    "I can see my house from here!"
    It's not what you know, but knowing how to find it if you don't know that's important

Re: Regex runs for too long
by japhy (Canon) on Jan 23, 2002 at 23:52 UTC
    I agree with Masem split() is the way to go here.

    Still, as a general regex efficiency principle, you can rewrite /(?:.(?!foo))*/ as /.*?foo/. The regex engine won't go character-by-character; it'll be smart about it and jump from one "f" to the next until it finds an "f" followed by "oo". Or you could use /[^f]*(?:f(?!oo)[^f]*)*/, although I'd personally go with the simpler .*? approach.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Regex runs for too long
by examine (Initiate) on Jan 24, 2002 at 01:45 UTC

    Maybe I should explain what I'm trying to do. Basically, the script accepts a multi-line search pattern as input. It then checks whether the pattern appears in the log. If so, it returns the segment of the log that matched the pattern. Otherwise, it returns nothing. The output gets parsed downstream, and specific pieces of data are pulled out (hostname, phone number, problem report, etc.) and sent to a pager.

    I had originally implemented this as a pair of nested foreach loops, similar to what masem suggested. The outer loop checked one line of the log at a time, and when the line matched the top line of the pattern, the inner loop would check the rest of the lines in the pattern against the next few lines in the log.

    The problem with this was that for large logs, it would take a long time to complete and use a lot of CPU in the process. I realized that 5000 lines of log data times 20 lines per search pattern times 20 search patterns means analyzing 2,000,000 log file lines at a time. I was hoping that taking advantage of perl's regex engine could help cut this back. For most data, the regex method is orders of magnitude faster. Where the foreach method would take a full minute to parse through several thousand lines of text, the regex method typically takes under a second to zip through tens of thousands of lines. However, there are a few combinations of search patterns and log segments that seem to bog it down.

    I am pretty new to writing regular expressions, in general. I had tried using .*? at one point, but I've been experimenting a lot in the process of troubleshooting. I think the reason I was using  /(?:.(?!foo))*/ as opposed to  /.*?foo/ was that I didn't want to match the line delimiter. I guess that doesn't matter, though, since I s// it out later, anyway. In any case, the regex engine gets bogged down with certain data either way.

    It may be that I should go back to using while and foreach loops and find other ways to optimize the process, but I was hoping there was something profoundly broken (and therefore fixable) about the way I'm using regex.

    examine
      You might be able to get some useful ideas by looking at how RE (tilly) 4: SAS log scanner works and applying that to your matching problem.

      The basic reference for understanding performance issues with REs is Mastering Regular Expressions by Friedl. The key is that the way that Perl's RE engine works is that it does a recursive search through ways of partially matching your pattern to the string, every time it fails backing up to the last choice it had and trying again. The problem is that the tree that it has to search can easily be exponential in the size of the string you are trying to match. The pattern /(foo)(.*)*\1\2/ on the string "foo and lots of garbage here..." is a canonical example. Of course Perl works hard to recognize the simple examples of this issue and special cases them so that won't be slow.

      The general answer is therefore to write your RE in such a way that the RE makes as few choices as possible, and so that where it does make choices, it makes them as late as possible so that you spend less time backtracking then redoing the same reasoning as possible. In this specific case the best you can do is watch the RE match, and as it goes, trace on pen and paper exactly what choices it is making where. You already know that the RE engine is backtracking, but trying to reason along with it should show you why it is backtracking. (The hard part of the exercise is that you will tend to leap to general conclusions that you know are right, and miss that the computer is going to grind away missing the obvious at great length...)

      First, try a simple regex or index to see if the line could be what you want. Skip the rest if it isn't.
      Then, try a more complex (but not as complex as the one you already use) regex to see if you were right. If not, it's not too late to skip.
      Then, try to parse the data (split it), and validate it to check if the previous checks were right.

      If more lines are valid, use less checks, if less lines are valid, use more checks. It's all about knowing your data. (If all lines are valid, use no checks :)

      while (<>) { next unless /first check/; next unless /second one/; my @foo = split /delimiter/; next if @foo != 5; # or whatever ... } <p><font color=green><code> 2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

Re: Regex runs for too long
by JayBonci (Curate) on Jan 24, 2002 at 15:52 UTC
    According to perlfunc, study may or may not help you out on lengthy regular expressions. I tried it once on a huge set of regex'es (piles and piles of text), and it did make it somewhat faster as far as I can tell. You may want to give that a shot both with and without, to see if it's possible to pick up any speed on it.

    It's not a great algorithmic trick like some of the shrewd monks before me have suggested, but it may be able to add to those solutions with it's pre-indexing. A more detailed explaination of it's organization can be found on the link above. Good luck !

    --jay