Creative sorting and totalling of large flatfiles (aka pivot tables)

dbg has asked for the wisdom of the Perl Monks concerning the following question:

Basic history is that we take CSV flatfiles into Excel, run a pivot table on them, and use the results to make neat graphs. As these are firewall logs, you can imagine that the data is varied and numerous. Frequently we wind up with over 65k records for a month however, which is Excel's limit - so we have to export many CSV's and manually combine pivot table results. This is very labor intensive, and my thinking is that PERL is the right weapon to tackle this.

The example file in question is approximately 1.7 million records in length (about 140mb), with each row having about 7-8 fields. I'm torn between trying to handle this all in PERL's structures somehow, or importing it into a database and doing something with it there. I'm also unsure of the best way to go about actually counting the values (this is where perhaps a DB search function might be useful, and fast?). Some examples of what we'd look for include, "Top 10 source IP's" or "Number of blocked sessions by date". Each record would include a date/time, IP info typical to a firewall log, attack severity, etc.

Any input here would be most appreciated. The data structure isn't complex, but processing it efficiently and quickly may take some creative thinking. I also have to consider that a database may get rather large, since this is just one file of about 30 I would need to process each month, and hopefully keep around for a little while for historical reasons. Thanks!

Comment on Creative sorting and totalling of large flatfiles (aka pivot tables)

Replies are listed 'Best First'.
Re: Creative sorting and totalling of large flatfiles (aka pivot tables) by kvale (Monsignor) on Dec 02, 2004 at 00:58 UTC
In my experience, calculating summary statistics in perl for a few million records is not terribly time consuming. If the CSV file has no quoting issues, then a simple loop suffices: `open LOG, "<$log_file" or die "Could not open $log_file\n"; my %ip_count; while (<LOG>) { my ($ip, $severity, $date, ...) = split /\s,\s/; $ip_count{$ip}++; # other summary stat calcs below }` [download] For more complex CSV files, try Text::xSV, which handles the full CSV grammar. For an athlon xp2100 system and a gig of memory, most stats calculations with 1-10x10^6 records typically took 1-10 minutes. Even 30 of these will only take a few hours. As long as you process one line at a time and and have enough RAM to hold your hashes, calculations should go quickly. -Mark	[reply] [d/l]
Re: Creative sorting and totalling of large flatfiles (aka pivot tables) by NetWallah (Canon) on Dec 02, 2004 at 05:31 UTC
I would strongly advocate a database. Typically, you (or management) initially think this is a small one-time thing - all you need are a few summary numbers. What usually happens is that after you see a few nice, somewhat meaningful numbers, you realize what you are NOT seeing, and need to generate more stats. Then meta stats, then trends. Alarms/threasholds. And it goes on ... WIth a database, you can store periodic info about an IP address - how frequently has this been in the top ten .. is his traffic rate increasing over time .. Which IP's have stopped being used. which subnet generates the most bad traffic etc... The other advantage is that your data collection and analysis become separate processes. Now you can get close to real-time data analysis. Web interfaces become possible. It's all good (Except the work required to code all this, but even that can be enjoyable). ...each is assigned his own private delusion but he cannot see the baggage on his own back.	[reply]
Re: Creative sorting and totalling of large flatfiles (aka pivot tables) by pearlie (Sexton) on Dec 02, 2004 at 05:42 UTC
You can also create graphs from CSV file using DBIx::Chart module.	[reply]
Re: Creative sorting and totalling of large flatfiles (aka pivot tables) by slife (Scribe) on Dec 02, 2004 at 10:30 UTC
As well as putting your data in a database, you should have a look at RRDtool (http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/) and MRTG (http://people.ee.ethz.ch/~oetiker/webtools/mrtg/) for your graphing, as these tools were developed precisely for this sort of work. Written in C+Perl, too.	[reply]