Most revered Monks,
I've been given a list of shell regexp's and a logfile of a few million lines. The matching lines need to be taken out. Basicly something like:
grep -v -f list_regexps.txt logfile.log
The format of the logfile is something like:
number string1 string2
The regexps apply to the last string in the line. I also want to know how manny times a single regexp matches.
To achieve this I wrote something like this:
# taken from the cookbook: sub glob2pat { my( $globstr ) = @_; my %patmap = ( '*' => '.*', '?' => '.', '[' => '[', ']' => ']', ); $globstr =~ s{(.)} { $patmap{$1} || "\Q$1" }ge; return '^' . $globstr . '$'; } ... my %ignore = map { glob2pat( $_ ) => 0 } @list_regexps; ... while(<FILE>) { chomp; my( @cols ) = split(" ", $_); my $do_not_print; ... foreach my $regexp ( keys %ignore ) { next if( $do_not_print ); if ( $cols[0] =~ m/$regexp/ ) { $ignore{$regexp}++; $do_not_print++; } } next if( $do_not_print ); }
This works but is slow. I found out that string compare is a lot faster then pattern matching so I did the perl equivalent of:
awk '{ print $NF }' logfile.log | sort | uniq -c | sort -nr | head -4000 | awk '{ print $NF }' >temp.dat
And applied my list of regexps on that. Which looks like:
my %strcmp; # get the uniq list open( DF, "logfile.log" ); while(<DF>) { chomp; $data{(split(" ",$_))[2]}++; } close(DF); my @keys = sort {$data{$b}<=>$data{$a}} keys %data; @keys = splice(@keys,0,4000); open( OUT, ">temp.dat" ); foreach my $line ( @keys ) { print OUT "$line\n"; } close( OUT ); # create the list of strings matching the patterns open(IN, "$outputfile"); while(<IN>) { chomp(); # matche data foreach my $regexp ( keys %regexp ) { $strcmp{$_} = $regexp{$regexp} if ( $_ =~ m/$regexp/ ) +; } } close( IN ); # applie this to the logfile open( IN, "logfile.log" ); open( OUT, ">logfile_parsed.log" ); while(<IN>) { chomp; my @cols = split(" ",$_); next if( exists $strcmp{$cols[2]} ); print OUT "$_\n"; } close( OUT ); close( IN ); move( "logfile_parsed.log", "logfile.log" ); # from here more or less the same as the first perl listing.
This significantly sped up the process (-30% - -40%) mainly because this removed the highest scoring strings. Currently a single run takes about 2,5 to 3 hours and the datasize is expected to double in the near future. So I'm looking at any performance gain I might get.
In reply to regexp performance on large logfiles by snl_JYDawg
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |