bigbot has asked for the wisdom of the Perl Monks concerning the following question:
I would like to learn about the most efficient and/or fastest ways to search through large plain-text files, which contain something similar to ASCII tcpdump output of network traffic. The files are in many cases around 1GB in size and contain one hour of network traffic. I have no control over how this data is generated or stored, so I have to make the best of searching through the plain text files (I realize it would probably be more efficient if the files were in tcpdump raw output or some kind of binary format), but as I said I have no control over this. The files are therefore linear by time.
The tool I need to write will take a user input of two IP Addresses, and return all packets from the plain text file that contain both IPs. The tool needs to match traffic going both ways, so I cannot assume that either IP inputted is the source or destination IP; It must check both cases. Here is some sample data:
2011-01-30 17:21:25.990853 IP 10.10.10.53.2994 > 205.128.64.126.80 .!)~.....Bb...E..(l8@...lZ 2011-01-30 17:21:26.056348 IP 10.10.10.53.2994 > 205.128.64.126.80 GET /j/MSNBC/Components/Photo/_new/110120-durango_tease.thumb.jpg HTTP +/1.1 Accept: */* Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident +/4.0; InfoPath.2) 2011-01-30 17:21:26.078293 IP 205.128.64.126.80 > 10.10.10.53.2994 ...Bb..!)~....E....L../.....@~
Using Perl exclusively I have found that the following gave me the best performance in searching speed. Going line-by-line seemed faster than trying to load these giant files into memory. Obviously the tool will be bigger and keep track of packet state but this IF line by far has the biggest impact on speed:
open FILE, "<", "filename.txt" or die $! while (<FILE>) { if (($_ =~ /^$year\-/) && ($_ =~ /\Q $IP1 \E/) && ($_ =~ /\Q $IP2 +\E/)) { print "packet match found!"; } }
I am looking for a faster way to search if possible, using Perl, Awk, Grep, Python, or even C. I would appreciate any advice on this and on writing the tool in general for speed. This will be used extensively and any performance/efficiency improvements will make a huge impact. Thanks!
|
|---|