Most revered Monks,

I've been given a list of shell regexp's and a logfile of a few million lines. The matching lines need to be taken out. Basicly something like:

grep -v -f list_regexps.txt logfile.log

The format of the logfile is something like:

number string1 string2

The regexps apply to the last string in the line. I also want to know how manny times a single regexp matches.

To achieve this I wrote something like this:

# taken from the cookbook: sub glob2pat { my( $globstr ) = @_; my %patmap = ( '*' => '.*', '?' => '.', '[' => '[', ']' => ']', ); $globstr =~ s{(.)} { $patmap{$1} || "\Q$1" }ge; return '^' . $globstr . '$'; } ... my %ignore = map { glob2pat( $_ ) => 0 } @list_regexps; ... while(<FILE>) { chomp; my( @cols ) = split(" ", $_); my $do_not_print; ... foreach my $regexp ( keys %ignore ) { next if( $do_not_print ); if ( $cols[0] =~ m/$regexp/ ) { $ignore{$regexp}++; $do_not_print++; } } next if( $do_not_print ); }

This works but is slow. I found out that string compare is a lot faster then pattern matching so I did the perl equivalent of:

awk '{ print $NF }' logfile.log | sort | uniq -c | sort -nr | head -4000 | awk '{ print $NF }' >temp.dat

And applied my list of regexps on that. Which looks like:

my %strcmp; # get the uniq list open( DF, "logfile.log" ); while(<DF>) { chomp; $data{(split(" ",$_))[2]}++; } close(DF); my @keys = sort {$data{$b}<=>$data{$a}} keys %data; @keys = splice(@keys,0,4000); open( OUT, ">temp.dat" ); foreach my $line ( @keys ) { print OUT "$line\n"; } close( OUT ); # create the list of strings matching the patterns open(IN, "$outputfile"); while(<IN>) { chomp(); # matche data foreach my $regexp ( keys %regexp ) { $strcmp{$_} = $regexp{$regexp} if ( $_ =~ m/$regexp/ ) +; } } close( IN ); # applie this to the logfile open( IN, "logfile.log" ); open( OUT, ">logfile_parsed.log" ); while(<IN>) { chomp; my @cols = split(" ",$_); next if( exists $strcmp{$cols[2]} ); print OUT "$_\n"; } close( OUT ); close( IN ); move( "logfile_parsed.log", "logfile.log" ); # from here more or less the same as the first perl listing.

This significantly sped up the process (-30% - -40%) mainly because this removed the highest scoring strings. Currently a single run takes about 2,5 to 3 hours and the datasize is expected to double in the near future. So I'm looking at any performance gain I might get.


In reply to regexp performance on large logfiles by snl_JYDawg

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.