in reply to regexp performance on large logfiles

You don't show us neither the log data you're matching against nor the strings you're searching. As you say that you're mostly looking for stuff at the end of the line, it might be worthwhile to reverse the string and look for the reversed word at the start of the string. See sexeger.

You are converting some glob patterns to regular expressions. Depending on how your glob patterns look, you can gain lots by applying your domain knowledge. For example, you will likely know that all your strings are anchored to the end of the line. Also, if you store the compiled regular expressions instead of recompiling them every time from a string (keys %regexp), you likely gain a bit of performance.

Another thing might be to build one large regular expression from your patterns, so the regex engine does the loop instead of Perl doing the loop. See Regexp::Assemble for example, or Regexp::Trie (although that one shouldn't be necessary if you're using 5.10).

Also consider that IO might well be a limiting factor while trying to read the file. Storing your logfile compressed and then spawning gzip -cd $logfile| might or might not improve the situation, depending on whether disk/network IO is limiting you or not.

In your code, you do

for (...) { next if $do_not_print;

You can stop iterating through that loop by using last instead when you set $do_not_print to 1.

Replies are listed 'Best First'.
Re^2: regexp performance on large logfiles
by snl_JYDawg (Initiate) on Aug 05, 2008 at 10:34 UTC

    I'm matching surfdata (i.e. url's) against a list of almost a thousand patterns in the first iteration.

    The data looks something like:

    1217930320 jydawg http://www.perlmonks.org/index.pl

    and a relevant pattern might look something like:

    *.perlmonks.org

    With the large amount of patterns I wonder if I would benefit from Regexp::Assemble or Regexp::Trie.

    I'm looking into the other 2 pieces of advise. I don't suppose sticking a Gigabite of text data in an array is a good idea :P. Well maybe in chunks ...

      At least for those simplicistic patterns, you'll likely do much better if you just do a substring search. The regex engine will also do the substring search for you, but if you don't have www.*.info as valid pattern, using substr index could be faster. Even if you have such patterns, you can possibly move them towards the end of the checks so the rejection can happen earlier.

      Update: I meant index...

        current runtime: 25,4 minutes. just by re-examining the patterns and use as much index and substr as possible.
        Do you mean index? That shouldn't be any faster than a compiled regexp. The problem the OP has is that he's compiling his regexps over and over and over and over again.