in reply to Parsing Large Text Files For Performance

Often it's worth benchmarking variations on a little piece of code once a bottleneck has been identified. In this case the bottleneck is likely to be the test code so here's a little benchmark that tests a few variations on that task:

#!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese); my @lines = <DATA>; my $ip1 = '10.10.10.53.2994'; my $ip2 = '205.128.64.126.80'; push @lines, @lines for 1 .. 10; print "useRegex finds ", useRegex ($ip1, $ip2, \@lines), " matches \n" +; print "useRegex2 finds ", useRegex2 ($ip1, $ip2, \@lines), " matches \ +n"; print "useIndex finds ", useIndex ($ip1, $ip2, \@lines), " matches \n" +; cmpthese (-1, { useRegex => sub {useRegex ($ip1, $ip2, \@lines)}, useRegex2 => sub {useRegex2 ($ip1, $ip2, \@lines)}, useIndex => sub {useIndex ($ip1, $ip2, \@lines)}, }, ); sub useRegex { my ($ip1, $ip2, $lines) = @_; my $match = qr{$ip1.*$ip2|$ip2.*$ip1}; my $matches; for my $line (@$lines) { next if $line !~ $match; ++$matches; } return $matches; } sub useRegex2 { my ($ip1, $ip2, $lines) = @_; my $match1 = qr{$ip1}; my $match2 = qr{$ip2}; my $matches; for my $line (@$lines) { next if $line !~ $match1 || $line !~ $match2; ++$matches; } return $matches; } sub useIndex { my ($ip1, $ip2, $lines) = @_; my $matches; for my $line (@$lines) { next if 0 > index ($line, $ip1) || 0 > index ($line, $ip2); ++$matches; } return $matches; } __DATA__ 2011-01-30 17:21:25.990853 IP 10.10.10.53.2994 > 205.128.64.126.80 .!)~.....Bb...E..(l8@...lZ GET /j/MSNBC/Components/Photo/_new/110120-durango_tease.thumb.jpg HTTP +/1.1 Accept: */* Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident +/4.0; InfoPath.2) IP 205.128.64.126.80 IP 10.10.10.53.2994 2011-01-30 17:21:26.078293 IP 205.128.64.126.80 > 10.10.10.53.2994 ...Bb..!)~....E....L../.....@~

Prints:

useRegex finds 2048 matches useRegex2 finds 2048 matches useIndex finds 2048 matches Rate useRegex useRegex2 useIndex useRegex 71.6/s -- -9% -76% useRegex2 78.4/s 10% -- -74% useIndex 299/s 318% 281% --

In this case it looks like index is a pretty clear winner.

True laziness is hard work