Parsing Large Text Files For Performance

bigbot has asked for the wisdom of the Perl Monks concerning the following question:

I would like to learn about the most efficient and/or fastest ways to search through large plain-text files, which contain something similar to ASCII tcpdump output of network traffic. The files are in many cases around 1GB in size and contain one hour of network traffic. I have no control over how this data is generated or stored, so I have to make the best of searching through the plain text files (I realize it would probably be more efficient if the files were in tcpdump raw output or some kind of binary format), but as I said I have no control over this. The files are therefore linear by time.

The tool I need to write will take a user input of two IP Addresses, and return all packets from the plain text file that contain both IPs. The tool needs to match traffic going both ways, so I cannot assume that either IP inputted is the source or destination IP; It must check both cases. Here is some sample data:

2011-01-30 17:21:25.990853 IP 10.10.10.53.2994 > 205.128.64.126.80
.!)~.....Bb...E..(l8@...lZ

2011-01-30 17:21:26.056348 IP 10.10.10.53.2994 > 205.128.64.126.80
GET /j/MSNBC/Components/Photo/_new/110120-durango_tease.thumb.jpg HTTP
+/1.1
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident
+/4.0; InfoPath.2)

2011-01-30 17:21:26.078293 IP 205.128.64.126.80 > 10.10.10.53.2994
...Bb..!)~....E....L../.....@~
[download]

Using Perl exclusively I have found that the following gave me the best performance in searching speed. Going line-by-line seemed faster than trying to load these giant files into memory. Obviously the tool will be bigger and keep track of packet state but this IF line by far has the biggest impact on speed:

open FILE, "<", "filename.txt" or die $!
while (<FILE>) {
    if (($_ =~ /^$year\-/) && ($_ =~ /\Q $IP1 \E/) && ($_ =~ /\Q $IP2 
+\E/)) { print "packet match found!"; }
}
[download]

I am looking for a faster way to search if possible, using Perl, Awk, Grep, Python, or even C. I would appreciate any advice on this and on writing the tool in general for speed. This will be used extensively and any performance/efficiency improvements will make a huge impact. Thanks!

Comment on Parsing Large Text Files For Performance Select or Download Code

Replies are listed 'Best First'.
Re: Parsing Large Text Files For Performance by dave_the_m (Monsignor) on Jan 31, 2011 at 00:21 UTC
First, try just a null reading loop in perl - this will give you a lower bound on what the Perl can be optimised to: `while (<FILE>) { }` [download] If this isn't fast enough, then you'll definitely have to use something other than Perl, at least for the initial filtering. If it is fast enough, then the next thing is to reject unwanted lines as quickly as possible. You probably want to do that with a single pre-compiled regexp: `my $re = qr/^$year-.{21} IP (?:\Q$IP1\E > \Q$IP2\E\|\Q$IP2\E > \Q$IP1\E +/; while (<FILE>) { next unless /$re/; ... do something ... }` [download] The exact form of the pattern will depend on how flexible the the 1st line of each entry is, e.g. whether it uses a fixed number of digits for the sub-seconds value Dave.	[reply] [d/l] [select]
Re: Parsing Large Text Files For Performance by GrandFather (Saint) on Jan 31, 2011 at 01:24 UTC
Often it's worth benchmarking variations on a little piece of code once a bottleneck has been identified. In this case the bottleneck is likely to be the test code so here's a little benchmark that tests a few variations on that task: #!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese); my @lines = <DATA>; my $ip1 = '10.10.10.53.2994'; my $ip2 = '205.128.64.126.80'; push @lines, @lines for 1 .. 10; print "useRegex finds ", useRegex ($ip1, $ip2, \@lines), " matches \n" +; print "useRegex2 finds ", useRegex2 ($ip1, $ip2, \@lines), " matches \ +n"; print "useIndex finds ", useIndex ($ip1, $ip2, \@lines), " matches \n" +; cmpthese (-1, { useRegex => sub {useRegex ($ip1, $ip2, \@lines)}, useRegex2 => sub {useRegex2 ($ip1, $ip2, \@lines)}, useIndex => sub {useIndex ($ip1, $ip2, \@lines)}, }, ); sub useRegex { my ($ip1, $ip2, $lines) = @_; my $match = qr{$ip1.$ip2\|$ip2.$ip1}; my $matches; for my $line (@$lines) { next if $line !~ $match; ++$matches; } return $matches; } sub useRegex2 { my ($ip1, $ip2, $lines) = @_; my $match1 = qr{$ip1}; my $match2 = qr{$ip2}; my $matches; for my $line (@$lines) { next if $line !~ $match1 \|\| $line !~ $match2; ++$matches; } return $matches; } sub useIndex { my ($ip1, $ip2, $lines) = @_; my $matches; for my $line (@$lines) { next if 0 > index ($line, $ip1) \|\| 0 > index ($line, $ip2); ++$matches; } return $matches; } __DATA__ 2011-01-30 17:21:25.990853 IP 10.10.10.53.2994 > 205.128.64.126.80 .!)~.....Bb...E..(l8@...lZ GET /j/MSNBC/Components/Photo/_new/110120-durango_tease.thumb.jpg HTTP +/1.1 Accept: / Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident +/4.0; InfoPath.2) IP 205.128.64.126.80 IP 10.10.10.53.2994 2011-01-30 17:21:26.078293 IP 205.128.64.126.80 > 10.10.10.53.2994 ...Bb..!)~....E....L../.....@~ [download] Prints: `useRegex finds 2048 matches useRegex2 finds 2048 matches useIndex finds 2048 matches Rate useRegex useRegex2 useIndex useRegex 71.6/s -- -9% -76% useRegex2 78.4/s 10% -- -74% useIndex 299/s 318% 281% --` [download] In this case it looks like index is a pretty clear winner. True laziness is hard work	[reply] [d/l] [select]
Re: Parsing Large Text Files For Performance by davido (Cardinal) on Jan 31, 2011 at 00:55 UTC
dave_the_m's advice is spot on. I did have one thought though. If IP-1 and IP-2 are always presented in the same order (any line containing both IP's always presents the same one first), you could set special variable $/ to the string of IP-2. That way, instead of reading lines you'll be reading records ending with the 2nd critical IP address. Then all you have to do is scan said record to see if the 1st critical IP address appears after the nearest preceding newline character. If so, you've got a match. Why would this be theoretically advantageous? It may (depending on how often IP-2 shows up) result in fewer iterations through the while loop. You're still reading the whole file, but only doing a regexp check if you already know that half of the condition has been met. When reading a file there is an implicit check happening; behind the scenes perl looks for $/ to end each record. May as well use that implicit 'check' to your advantage. Of course this adds additional complexity if you actually need to also capture info that comes after that 2nd IP address in the file. At that point, it would be difficult to guess as to whether the additional logic needed to handle that need would negate any minor advantage this path might have in the first place. I guess that means YMMV (Your mileage may vary). Dave	[reply]
Re^2: Parsing Large Text Files For Performance by Anonymous Monk on Jan 31, 2011 at 02:29 UTC
Thanks so much for the timely advice! I will do some testing and get back to you with the results.	[reply]
Re: Parsing Large Text Files For Performance by jethro (Monsignor) on Jan 31, 2011 at 01:26 UTC
On really big files your speed should be bounded by how fast you can read from disk. If that is the case, C or something else would only have a lower CPU utilization, not more speed. Nevertheless the fastest search will be with grep I suspect. You might want to compare your solution to a simple "grep IP1 \| grep IP2 \| grep year" on the command line. If IP1 isn't too common the following two greps won't be a relevant factor into the total running time. And grep is highly optimised C-code.	[reply]
Re: Parsing Large Text Files For Performance by BrowserUk (Patriarch) on Jan 31, 2011 at 02:07 UTC
As you require each line to match 3 separate criteria you are effectively having to search each (matching) line 3 times. If you have multiple cores, then you may be able to gain some performance by overlapping the searches by using separate processes to perform each of the searches: `grep '2010' file \| grep '123.456.0.1' \| grep '456.321.0.1'` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Parsing Large Text Files For Performance by hsinclai (Deacon) on Jan 31, 2011 at 03:33 UTC
As an option to reading line by line from a file handle, the technique in Re: Muy Large File might be adapted.. the sysread, sysseek, syswrite are builtins. IIRC the original goal of that OP was in-place replacement of strings, but you might be able to locate the IPs and then the start and end of lines (packets) by this method... I saved this one because it's so fast for reading and ripping through files off disk.	[reply] [d/l]
Re: Parsing Large Text Files For Performance by bigbot (Beadle) on Jan 31, 2011 at 06:41 UTC
Here's what I've come up with so far (using some of the improvements suggested and benchmarking using GrandFather's example): `my $hdr_search = qr{^\-{4}\s\d}; sub useIndex { my ($ip1, $ip2, $lines) = @_; for my $line (@$lines) { unless (($line =~ $hdr_search) \|\| ($header)) { next; } unless ($line =~ $hdr_search) { print $line; ## Print matching packet body next; } if (0 > index ($line, $ip1) \|\| (0 > index ($line, $ip2))) { $header = 0; next; } else { print "\n$line"; ## Print matching packet header $header = 1; } } }` [download] Using the index function along with getting rid of IF cases early has definitely sped things up considerably. Thoughts?	[reply] [d/l]
Re: Parsing Large Text Files For Performance by roboticus (Chancellor) on Jan 31, 2011 at 10:57 UTC
bigbot: You've already received some great answers. However, they all assume that you're searching a particular log file only once. On the off chance you need to search repeatedly through the log file(s), though, you may want to alter your strategy. You might: Digest the log file into a more convenient form for your searches. Build an index of the first and last time each IP is used. Dump the information into a database. Split the file into neatly formatted chunks based on IP ranges. Perform multiple searches in the same pass. ...other optimizations left as an exercise for the reader... ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re: Parsing Large Text Files For Performance by bigbot (Beadle) on Jan 31, 2011 at 02:34 UTC
Thanks everyone! I will do some testing and let you know how it goes.	[reply]
Re: Parsing Large Text Files For Performance by educated_foo (Vicar) on Jan 31, 2011 at 16:33 UTC
You might look at the "wide finder" benchmark from a few years ago for ideas of how to do this quickly.	[reply]
Re: Parsing Large Text Files For Performance by bigbot (Beadle) on Jan 31, 2011 at 14:45 UTC
I made some more changes and found that I could increase speed by doing a fixed string grep first (thanks jethro). What I did was grep for one of the IPs and used the context option to display 50 lines below the packet header. Then I processed this output using the script I posted above. The initial improvements in speed are 60x faster over what I was doing before. This is great!	[reply]
Re: Parsing Large Text Files For Performance by bigbot (Beadle) on Feb 01, 2011 at 00:03 UTC
One thing I'm curious about- Do you guys think that there would be a more efficient way to read in the file, instead of doing it line by line. Perhaps a way to read in one packet at a time? The files are too large for slurping, but is doing it line-by-line the most efficient way?	[reply]
Re^2: Parsing Large Text Files For Performance by BrowserUk (Patriarch) on Feb 01, 2011 at 00:35 UTC
In general, reading a file using fixed-sized blocks is somewhat faster than line by line. Especially if the fixed-size is chosen to coincide with the 'natural' read size of the filesystem of the device holding the file. This is easily explained by the fact that when reading line by line, the runtime first reads a block and then has to scan that block looking for the end of line character before transferring the appropriate number of bytes to another buffer for return to the calling program. But for your application where you want individual lines contain your matching terms, if you read the file block-wise, you'd then have to break up the block by searching for newlines either side of the matches anyway, so the net result would be the same amount of work. But searching for line ends in Perl will usually be slower than letting the system do so in C. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Parsing Large Text Files For Performance by GrandFather (Saint) on Feb 01, 2011 at 00:10 UTC
Write a benchmark and test it! There are too many variables that may come in to play for us to give a really useful answer without actually trying it on equivalent hardware to the system you are using. True laziness is hard work	[reply]
Re^3: Parsing Large Text Files For Performance by bigbot (Beadle) on Feb 01, 2011 at 00:22 UTC
I will do that, thanks!	[reply]