comment on

I would like to learn about the most efficient and/or fastest ways to search through large plain-text files, which contain something similar to ASCII tcpdump output of network traffic. The files are in many cases around 1GB in size and contain one hour of network traffic. I have no control over how this data is generated or stored, so I have to make the best of searching through the plain text files (I realize it would probably be more efficient if the files were in tcpdump raw output or some kind of binary format), but as I said I have no control over this. The files are therefore linear by time.

The tool I need to write will take a user input of two IP Addresses, and return all packets from the plain text file that contain both IPs. The tool needs to match traffic going both ways, so I cannot assume that either IP inputted is the source or destination IP; It must check both cases. Here is some sample data:

2011-01-30 17:21:25.990853 IP 10.10.10.53.2994 > 205.128.64.126.80
.!)~.....Bb...E..(l8@...lZ

2011-01-30 17:21:26.056348 IP 10.10.10.53.2994 > 205.128.64.126.80
GET /j/MSNBC/Components/Photo/_new/110120-durango_tease.thumb.jpg HTTP
+/1.1
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident
+/4.0; InfoPath.2)

2011-01-30 17:21:26.078293 IP 205.128.64.126.80 > 10.10.10.53.2994
...Bb..!)~....E....L../.....@~
[download]

Using Perl exclusively I have found that the following gave me the best performance in searching speed. Going line-by-line seemed faster than trying to load these giant files into memory. Obviously the tool will be bigger and keep track of packet state but this IF line by far has the biggest impact on speed:

open FILE, "<", "filename.txt" or die $!
while (<FILE>) {
    if (($_ =~ /^$year\-/) && ($_ =~ /\Q $IP1 \E/) && ($_ =~ /\Q $IP2 
+\E/)) { print "packet match found!"; }
}
[download]

I am looking for a faster way to search if possible, using Perl, Awk, Grep, Python, or even C. I would appreciate any advice on this and on writing the tool in general for speed. This will be used extensively and any performance/efficiency improvements will make a huge impact. Thanks!

In reply to Parsing Large Text Files For Performance by bigbot

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.