comment on

Most revered Monks,

I've been given a list of shell regexp's and a logfile of a few million lines. The matching lines need to be taken out. Basicly something like:

grep -v -f list_regexps.txt logfile.log

The format of the logfile is something like:

number string1 string2

The regexps apply to the last string in the line. I also want to know how manny times a single regexp matches.

To achieve this I wrote something like this:

# taken from the cookbook:
sub glob2pat {
        my( $globstr ) = @_;
        my %patmap = (
                '*' => '.*',
                '?' => '.',
                '[' => '[',
                ']' => ']',
        );
        $globstr =~ s{(.)} { $patmap{$1} || "\Q$1" }ge;
        return '^' . $globstr . '$';
}
...
my %ignore = map { glob2pat( $_ ) => 0 } @list_regexps;
...
while(<FILE>) {
        chomp;
        my( @cols ) = split(" ", $_);
        my $do_not_print;
...
        foreach my $regexp ( keys %ignore ) {
                next if( $do_not_print );
                if ( $cols[0] =~ m/$regexp/ ) {
                        $ignore{$regexp}++;
                        $do_not_print++;
                }
        }
        next if( $do_not_print );
}
[download]

This works but is slow. I found out that string compare is a lot faster then pattern matching so I did the perl equivalent of:

awk '{ print $NF }' logfile.log | sort | uniq -c | sort -nr | head -4000 | awk '{ print $NF }' >temp.dat

And applied my list of regexps on that. Which looks like:

my %strcmp;

# get the uniq list
open( DF, "logfile.log" );
while(<DF>) {
        chomp;
        $data{(split(" ",$_))[2]}++;
}
close(DF);

my @keys = sort {$data{$b}<=>$data{$a}} keys %data;
@keys = splice(@keys,0,4000);

open( OUT, ">temp.dat" );
foreach my $line ( @keys ) { print OUT "$line\n"; }
close( OUT );

# create the list of strings matching the patterns
open(IN, "$outputfile");
while(<IN>) {
        chomp();
        # matche data
        foreach my $regexp ( keys %regexp ) {
                $strcmp{$_} = $regexp{$regexp} if ( $_ =~ m/$regexp/ )
+;
        }
}
close( IN );

# applie this to the logfile
open( IN, "logfile.log" );
open( OUT, ">logfile_parsed.log" );
while(<IN>) {
        chomp;
        my @cols = split(" ",$_);
        next if( exists $strcmp{$cols[2]} );
        print OUT "$_\n";
}
close( OUT );
close( IN );
move( "logfile_parsed.log", "logfile.log" );

# from here more or less the same as the first perl listing.
[download]

This significantly sped up the process (-30% - -40%) mainly because this removed the highest scoring strings. Currently a single run takes about 2,5 to 3 hours and the datasize is expected to double in the near future. So I'm looking at any performance gain I might get.

In reply to regexp performance on large logfiles by snl_JYDawg

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.