Re: Parsing Large Text Files For Performance

Often it's worth benchmarking variations on a little piece of code once a bottleneck has been identified. In this case the bottleneck is likely to be the test code so here's a little benchmark that tests a few variations on that task:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my @lines = <DATA>;
my $ip1 = '10.10.10.53.2994';
my $ip2 = '205.128.64.126.80';

push @lines, @lines for 1 .. 10;

print "useRegex finds ", useRegex ($ip1, $ip2, \@lines), " matches \n"
+;
print "useRegex2 finds ", useRegex2 ($ip1, $ip2, \@lines), " matches \
+n";
print "useIndex finds ", useIndex ($ip1, $ip2, \@lines), " matches \n"
+;

cmpthese (-1, {
    useRegex => sub {useRegex ($ip1, $ip2, \@lines)},
    useRegex2 => sub {useRegex2 ($ip1, $ip2, \@lines)},
    useIndex => sub {useIndex ($ip1, $ip2, \@lines)},
    },
    );


sub useRegex {
    my ($ip1, $ip2, $lines) = @_;
    my $match = qr{$ip1.*$ip2|$ip2.*$ip1};
    my $matches;

    for my $line (@$lines) {
        next if $line !~ $match;
        ++$matches;
    }

    return $matches;
}


sub useRegex2 {
    my ($ip1, $ip2, $lines) = @_;
    my $match1 = qr{$ip1};
    my $match2 = qr{$ip2};
    my $matches;

    for my $line (@$lines) {
        next if $line !~ $match1 || $line !~ $match2;
        ++$matches;
    }

    return $matches;
}


sub useIndex {
    my ($ip1, $ip2, $lines) = @_;
    my $matches;

    for my $line (@$lines) {
        next if 0 > index ($line, $ip1) || 0 > index ($line, $ip2);
        ++$matches;
    }

    return $matches;
}


__DATA__
2011-01-30 17:21:25.990853 IP 10.10.10.53.2994 > 205.128.64.126.80
.!)~.....Bb...E..(l8@...lZ

GET /j/MSNBC/Components/Photo/_new/110120-durango_tease.thumb.jpg HTTP
+/1.1
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident
+/4.0; InfoPath.2)
IP 205.128.64.126.80
IP 10.10.10.53.2994
2011-01-30 17:21:26.078293 IP 205.128.64.126.80 > 10.10.10.53.2994
...Bb..!)~....E....L../.....@~
[download]

Prints:

useRegex finds 2048 matches 
useRegex2 finds 2048 matches 
useIndex finds 2048 matches 
            Rate  useRegex useRegex2  useIndex
useRegex  71.6/s        --       -9%      -76%
useRegex2 78.4/s       10%        --      -74%
useIndex   299/s      318%      281%        --
[download]

In this case it looks like index is a pretty clear winner.

True laziness is hard work

Comment on Re: Parsing Large Text Files For Performance Select or Download Code