Re: virus log parser

I have an anti-virus log file that I would like to eventually put into a mysql database; but I'm having problems parsing it.

This is exactly the class of problems for which Perl was designed. There are many ways to approach this problem, as has already been shown. I'd like to submit my quick and dirty version here. It reads through the log file (really anything on STDIN) and creates an array of hash references, suitable for sorting or iterating through to collect stats like most common email address or virus. This version is short and hopefully transparent.

#!/usr/bin/perl
use strict;
use Data::Dumper;

my (@log, %rec);
while(<>){

  if( /^-/ ){
    push @log, { %rec };
    %rec = ();
    next;
  }

  chomp;
  my ($k, $v) = split /\s*:\s*/, $_, 2;
  $rec{ $k } = $v if $k;
}

push @log, { %rec } if keys %rec;

print Dumper(\@log);
[download]

While slurping input, each line is checked to see if it is an "end of record" marker, which is defined here as any line beginning with a dash. If this doesn't match your reality, you will need to tinker with this line. When the end of record is found, the hash that represents that record is stuffed into the @log array. Since arrays can only hold scalar values, a hash reference is needed. Unfortunately, we can't simply use the reference made like this: \%rec because that hash will be erased on the next line! Instead, we create a brand new anonymous hash with { } and stuff that away. We clear out the "global" hash and grab the next line of input.

If the line of input isn't an end of record line, then the newline is removed and the very potent split operate is used to separate the key from the value. This assumes the the key and value are on the same line, of course. As a defensive measure, ancillary whitespace will be consumed around the colon. The often neglicted third argument to split indicates how many fields split should produces. Even if a colon appears somewhat in the value field, it will still appear as part of the $v variable. After creating a key and a value variable ($k and $v), the record hash is populated with these values provided the key is a true value. This prevents silly things like blank or malformed lines from disturbing your hash.

When the loop exits, you might not have pushed the last hash into the @log array (e.g. the last record separator might have gone missing on you). Therefore, a check is made to see if %rec has any keys which would cause that record to be dumped into @log.

I use Data::Dumper merely to show that @log has been populated correctly. If you aren't familiar with Data::Dumper, do make yourself acquainted. It can be a real lifesaver.

I leave the writing of the analysis of @log as an exercise for the reader. If references and dereferences make your head spin, take a look at Mark-Jason Dominus's Understand References Today.

Hope this helps.

Comment on Re: virus log parser Download Code