snl_JYDawg has asked for the wisdom of the Perl Monks concerning the following question:

Most revered Monks,

I've been given a list of shell regexp's and a logfile of a few million lines. The matching lines need to be taken out. Basicly something like:

grep -v -f list_regexps.txt logfile.log

The format of the logfile is something like:

number string1 string2

The regexps apply to the last string in the line. I also want to know how manny times a single regexp matches.

To achieve this I wrote something like this:

# taken from the cookbook: sub glob2pat { my( $globstr ) = @_; my %patmap = ( '*' => '.*', '?' => '.', '[' => '[', ']' => ']', ); $globstr =~ s{(.)} { $patmap{$1} || "\Q$1" }ge; return '^' . $globstr . '$'; } ... my %ignore = map { glob2pat( $_ ) => 0 } @list_regexps; ... while(<FILE>) { chomp; my( @cols ) = split(" ", $_); my $do_not_print; ... foreach my $regexp ( keys %ignore ) { next if( $do_not_print ); if ( $cols[0] =~ m/$regexp/ ) { $ignore{$regexp}++; $do_not_print++; } } next if( $do_not_print ); }

This works but is slow. I found out that string compare is a lot faster then pattern matching so I did the perl equivalent of:

awk '{ print $NF }' logfile.log | sort | uniq -c | sort -nr | head -4000 | awk '{ print $NF }' >temp.dat

And applied my list of regexps on that. Which looks like:

my %strcmp; # get the uniq list open( DF, "logfile.log" ); while(<DF>) { chomp; $data{(split(" ",$_))[2]}++; } close(DF); my @keys = sort {$data{$b}<=>$data{$a}} keys %data; @keys = splice(@keys,0,4000); open( OUT, ">temp.dat" ); foreach my $line ( @keys ) { print OUT "$line\n"; } close( OUT ); # create the list of strings matching the patterns open(IN, "$outputfile"); while(<IN>) { chomp(); # matche data foreach my $regexp ( keys %regexp ) { $strcmp{$_} = $regexp{$regexp} if ( $_ =~ m/$regexp/ ) +; } } close( IN ); # applie this to the logfile open( IN, "logfile.log" ); open( OUT, ">logfile_parsed.log" ); while(<IN>) { chomp; my @cols = split(" ",$_); next if( exists $strcmp{$cols[2]} ); print OUT "$_\n"; } close( OUT ); close( IN ); move( "logfile_parsed.log", "logfile.log" ); # from here more or less the same as the first perl listing.

This significantly sped up the process (-30% - -40%) mainly because this removed the highest scoring strings. Currently a single run takes about 2,5 to 3 hours and the datasize is expected to double in the near future. So I'm looking at any performance gain I might get.

Replies are listed 'Best First'.
Re: regexp performance on large logfiles
by Corion (Patriarch) on Aug 05, 2008 at 09:00 UTC

    You don't show us neither the log data you're matching against nor the strings you're searching. As you say that you're mostly looking for stuff at the end of the line, it might be worthwhile to reverse the string and look for the reversed word at the start of the string. See sexeger.

    You are converting some glob patterns to regular expressions. Depending on how your glob patterns look, you can gain lots by applying your domain knowledge. For example, you will likely know that all your strings are anchored to the end of the line. Also, if you store the compiled regular expressions instead of recompiling them every time from a string (keys %regexp), you likely gain a bit of performance.

    Another thing might be to build one large regular expression from your patterns, so the regex engine does the loop instead of Perl doing the loop. See Regexp::Assemble for example, or Regexp::Trie (although that one shouldn't be necessary if you're using 5.10).

    Also consider that IO might well be a limiting factor while trying to read the file. Storing your logfile compressed and then spawning gzip -cd $logfile| might or might not improve the situation, depending on whether disk/network IO is limiting you or not.

    In your code, you do

    for (...) { next if $do_not_print;

    You can stop iterating through that loop by using last instead when you set $do_not_print to 1.

      I'm matching surfdata (i.e. url's) against a list of almost a thousand patterns in the first iteration.

      The data looks something like:

      1217930320 jydawg http://www.perlmonks.org/index.pl

      and a relevant pattern might look something like:

      *.perlmonks.org

      With the large amount of patterns I wonder if I would benefit from Regexp::Assemble or Regexp::Trie.

      I'm looking into the other 2 pieces of advise. I don't suppose sticking a Gigabite of text data in an array is a good idea :P. Well maybe in chunks ...

        At least for those simplicistic patterns, you'll likely do much better if you just do a substring search. The regex engine will also do the substring search for you, but if you don't have www.*.info as valid pattern, using substr index could be faster. Even if you have such patterns, you can possibly move them towards the end of the checks so the rejection can happen earlier.

        Update: I meant index...

Re: regexp performance on large logfiles
by perrin (Chancellor) on Aug 05, 2008 at 13:21 UTC
Re: regexp performance on large logfiles
by JavaFan (Canon) on Aug 05, 2008 at 10:38 UTC
    Considering that you seem to be using grep style patterns (which are much simpler than Perl patterns), have you checked whether using grep may be faster? You may even want to compare various implementations of "grep". Then again, if the patterns are simple, Perl might not use the regexp engine at all, and just let the optimizer do all the heavy work.

      Nope, in some cases perl is much faster, this is one. I'm quite fond of yea-old-shell-scripting but here perl is/was the smarter choice

Re: regexp performance on large logfiles
by ikegami (Patriarch) on Aug 05, 2008 at 17:01 UTC
    • some loop { for my $regexp ( regexps ) { /$regexp/ } }

      is a very bad pattern unless the regexps are precompiled. In your case, the regexps are not precompiled since you're using a hash key (a string) as the regexp.

      In the first program, you could even flatten all of %ignore into a single regexp if you didn't need to do $ignore{$regexp}++.

    • The performance of the regexp engine depends highly on the regexp in question, but you didn't show us the regexps.

    • Since you only case about one value returned by split, a regexp should be faster than split.

Re: regexp performance on large logfiles
by ikegami (Patriarch) on Aug 05, 2008 at 17:41 UTC

    Some improvements. Notes:

    • %ignore now holds compiled patterns.
    • Counts are now in %ignored.
    • glob2pat should be faster by processing multiple characters at once, but that shouldn't affect your run time.
    • Removed $do_not_print to reduce clutter. A label is used on the while loop instead.
    • In the second snippet, Regexp::Assemble is used to factor out the "^" and "$" which is common to all ignore regexps.
    • The chomp is probably unnecessary.
    • If you only use $col[0] of @cols, replace
      my( @cols ) = split(" ", $_);
      with
      my( $col0 ) = /^\s*(\S+)/;
    { my %patmap = ( '*' => '.*', '?' => '.', '[' => '[', ']' => ']', ); my ($norm) = map qr/[^$_]/, join '', keys %patmap; sub glob2pat { my $globstr = @_ ? $_[0] : $_; $globstr =~ s{($norm+|.)} { $patmap{$1} || "\Q$1" }sge; return "^$globstr$"; } } my %ignore = map { $_ => qr/$_/ }, map glob2pat, @list_regexps; my %ignored; LINE: while(<FILE>) { chomp; my( @cols ) = split(" ", $_); ... for my $re_name ( keys %ignore ) { my $re = $ignore{$re_name}; if ( $cols[0] =~ $re ) { $ignored{$re_name}++; next LINE; } } ... }

    If you can live without %ignored,

    use Regexp::Assemble qw( ) { my %patmap = ( '*' => '.*', '?' => '.', '[' => '[', ']' => ']', ); my ($norm) = map qr/[^$_]/, join '', keys %patmap; sub glob2pat { my $globstr = @_ ? $_[0] : $_; $globstr =~ s{($norm+|.)} { $patmap{$1} || "\Q$1" }sge; return "^$globstr$"; } } my $ignore_re = do { my $ra = Regexp::Assemble->new(); $ra->add(glob2pat($_)) for @list_regexps; $ra->re() }; while(<FILE>) { chomp; my( @cols ) = split(" ", $_); ... next if $cols[0] =~ $ignore_re; ... }