regexp performance on large logfiles

snl_JYDawg has asked for the wisdom of the Perl Monks concerning the following question:

Most revered Monks,

I've been given a list of shell regexp's and a logfile of a few million lines. The matching lines need to be taken out. Basicly something like:

grep -v -f list_regexps.txt logfile.log

The format of the logfile is something like:

number string1 string2

The regexps apply to the last string in the line. I also want to know how manny times a single regexp matches.

To achieve this I wrote something like this:

# taken from the cookbook:
sub glob2pat {
        my( $globstr ) = @_;
        my %patmap = (
                '*' => '.*',
                '?' => '.',
                '[' => '[',
                ']' => ']',
        );
        $globstr =~ s{(.)} { $patmap{$1} || "\Q$1" }ge;
        return '^' . $globstr . '$';
}
...
my %ignore = map { glob2pat( $_ ) => 0 } @list_regexps;
...
while(<FILE>) {
        chomp;
        my( @cols ) = split(" ", $_);
        my $do_not_print;
...
        foreach my $regexp ( keys %ignore ) {
                next if( $do_not_print );
                if ( $cols[0] =~ m/$regexp/ ) {
                        $ignore{$regexp}++;
                        $do_not_print++;
                }
        }
        next if( $do_not_print );
}
[download]

This works but is slow. I found out that string compare is a lot faster then pattern matching so I did the perl equivalent of:

awk '{ print $NF }' logfile.log | sort | uniq -c | sort -nr | head -4000 | awk '{ print $NF }' >temp.dat

And applied my list of regexps on that. Which looks like:

my %strcmp;

# get the uniq list
open( DF, "logfile.log" );
while(<DF>) {
        chomp;
        $data{(split(" ",$_))[2]}++;
}
close(DF);

my @keys = sort {$data{$b}<=>$data{$a}} keys %data;
@keys = splice(@keys,0,4000);

open( OUT, ">temp.dat" );
foreach my $line ( @keys ) { print OUT "$line\n"; }
close( OUT );

# create the list of strings matching the patterns
open(IN, "$outputfile");
while(<IN>) {
        chomp();
        # matche data
        foreach my $regexp ( keys %regexp ) {
                $strcmp{$_} = $regexp{$regexp} if ( $_ =~ m/$regexp/ )
+;
        }
}
close( IN );

# applie this to the logfile
open( IN, "logfile.log" );
open( OUT, ">logfile_parsed.log" );
while(<IN>) {
        chomp;
        my @cols = split(" ",$_);
        next if( exists $strcmp{$cols[2]} );
        print OUT "$_\n";
}
close( OUT );
close( IN );
move( "logfile_parsed.log", "logfile.log" );

# from here more or less the same as the first perl listing.
[download]

This significantly sped up the process (-30% - -40%) mainly because this removed the highest scoring strings. Currently a single run takes about 2,5 to 3 hours and the datasize is expected to double in the near future. So I'm looking at any performance gain I might get.

Comment on regexp performance on large logfiles Select or Download Code

Replies are listed 'Best First'.
Re: regexp performance on large logfiles by Corion (Patriarch) on Aug 05, 2008 at 09:00 UTC
You don't show us neither the log data you're matching against nor the strings you're searching. As you say that you're mostly looking for stuff at the end of the line, it might be worthwhile to reverse the string and look for the reversed word at the start of the string. See sexeger. You are converting some glob patterns to regular expressions. Depending on how your glob patterns look, you can gain lots by applying your domain knowledge. For example, you will likely know that all your strings are anchored to the end of the line. Also, if you store the compiled regular expressions instead of recompiling them every time from a string (`keys %regexp`), you likely gain a bit of performance. Another thing might be to build one large regular expression from your patterns, so the regex engine does the loop instead of Perl doing the loop. See Regexp::Assemble for example, or Regexp::Trie (although that one shouldn't be necessary if you're using 5.10). Also consider that IO might well be a limiting factor while trying to read the file. Storing your logfile compressed and then spawning `gzip -cd $logfile\|` might or might not improve the situation, depending on whether disk/network IO is limiting you or not. In your code, you do `for (...) { next if $do_not_print;` [download] You can stop iterating through that loop by using last instead when you set `$do_not_print` to 1.	[reply] [d/l] [select]
Re^2: regexp performance on large logfiles by snl_JYDawg (Initiate) on Aug 05, 2008 at 10:34 UTC
I'm matching surfdata (i.e. url's) against a list of almost a thousand patterns in the first iteration. The data looks something like: `1217930320 jydawg http://www.perlmonks.org/index.pl` and a relevant pattern might look something like: `*.perlmonks.org` With the large amount of patterns I wonder if I would benefit from Regexp::Assemble or Regexp::Trie. I'm looking into the other 2 pieces of advise. I don't suppose sticking a Gigabite of text data in an array is a good idea :P. Well maybe in chunks ...	[reply] [d/l] [select]
Re^3: regexp performance on large logfiles by Corion (Patriarch) on Aug 05, 2008 at 10:44 UTC
At least for those simplicistic patterns, you'll likely do much better if you just do a substring search. The regex engine will also do the substring search for you, but if you don't have `www..info` as valid pattern, using ~~substr~~ index could be faster. Even if you have such patterns, you can possibly move them towards the end of the checks so the rejection can happen earlier. Update*: I meant index...	[reply] [d/l]
Re^4: regexp performance on large logfiles by snl_JYDawg (Initiate) on Aug 05, 2008 at 13:37 UTC
Re^4: regexp performance on large logfiles by ikegami (Patriarch) on Aug 05, 2008 at 17:47 UTC
Re: regexp performance on large logfiles by perrin (Chancellor) on Aug 05, 2008 at 13:21 UTC
Use a sliding window.	[reply]
Re: regexp performance on large logfiles by JavaFan (Canon) on Aug 05, 2008 at 10:38 UTC
Considering that you seem to be using grep style patterns (which are much simpler than Perl patterns), have you checked whether using grep may be faster? You may even want to compare various implementations of "grep". Then again, if the patterns are simple, Perl might not use the regexp engine at all, and just let the optimizer do all the heavy work.	[reply]
Re^2: regexp performance on large logfiles by snl_JYDawg (Initiate) on Aug 05, 2008 at 11:03 UTC
Nope, in some cases perl is much faster, this is one. I'm quite fond of yea-old-shell-scripting but here perl is/was the smarter choice	[reply]
Re: regexp performance on large logfiles by ikegami (Patriarch) on Aug 05, 2008 at 17:01 UTC
`some loop { for my $regexp ( regexps ) { /$regexp/ } }` [download] is a very bad pattern unless the regexps are precompiled. In your case, the regexps are not precompiled since you're using a hash key (a string) as the regexp. In the first program, you could even flatten all of %ignore into a single regexp if you didn't need to do `$ignore{$regexp}++`. The performance of the regexp engine depends highly on the regexp in question, but you didn't show us the regexps. Since you only case about one value returned by `split`, a regexp should be faster than `split`.	[reply] [d/l] [select]
Re: regexp performance on large logfiles by ikegami (Patriarch) on Aug 05, 2008 at 17:41 UTC
Some improvements. Notes: `%ignore` now holds compiled patterns. Counts are now in `%ignored`. `glob2pat` should be faster by processing multiple characters at once, but that shouldn't affect your run time. Removed `$do_not_print` to reduce clutter. A label is used on the `while` loop instead. In the second snippet, Regexp::Assemble is used to factor out the "^" and "$" which is common to all ignore regexps. The `chomp` is probably unnecessary. If you only use `$col[0]` of `@cols`, replace `my( @cols ) = split(" ", $_);` with `my( $col0 ) = /^\s(\S+)/;` { my %patmap = ( '' => '.', '?' => '.', '[' => '[', ']' => ']', ); my ($norm) = map qr/[^$_]/, join '', keys %patmap; sub glob2pat { my $globstr = @_ ? $_[0] : $_; $globstr =~ s{($norm+\|.)} { $patmap{$1} \|\| "\Q$1" }sge; return "^$globstr$"; } } my %ignore = map { $_ => qr/$_/ }, map glob2pat, @list_regexps; my %ignored; LINE: while(<FILE>) { chomp; my( @cols ) = split(" ", $_); ... for my $re_name ( keys %ignore ) { my $re = $ignore{$re_name}; if ( $cols[0] =~ $re ) { $ignored{$re_name}++; next LINE; } } ... } [download] If you can live without `%ignored`, `use Regexp::Assemble qw( ) { my %patmap = ( '' => '.*', '?' => '.', '[' => '[', ']' => ']', ); my ($norm) = map qr/[^$_]/, join '', keys %patmap; sub glob2pat { my $globstr = @_ ? $_[0] : $_; $globstr =~ s{($norm+\|.)} { $patmap{$1} \|\| "\Q$1" }sge; return "^$globstr$"; } } my $ignore_re = do { my $ra = Regexp::Assemble->new(); $ra->add(glob2pat($_)) for @list_regexps; $ra->re() }; while(<FILE>) { chomp; my( @cols ) = split(" ", $_); ... next if $cols[0] =~ $ignore_re; ... }` [download]	[reply] [d/l] [select]