comment on

I'm not sure what you were planning with the matrices: if you want to work further with this data, or move it into a database, you're probably best off pulling it into a hash, or an array-of-hashes.

If the file is very large, or memory is limited, you may have to read the file line by line, as others have suggested, insert each completed record into the database and then use that to perform whatever analyses made you want to put them there in the first place.

If you're more interested in a quick scan - how much klez this week? - then a AoH will be more fun. You should probably still use a cursor to read the file, though. it might be more dashing to do an enormous split on -+, but not wise. especially if you reset $/ to do it. really wouldn't do that. a little too sweeping.

If there was a unique identifier with each record, then a HoH would be more useful: a big hash in which the keys come from your unique field and each value is another hash containing the foo=bar pairs you've extracted. The main advantage would be that you share a key with the original file, allowing (for example) incremental updates of the database.

but there doesn't seem to be a useful hook like that, unless perhaps the events are rare enough that you don't mind assuming the timestamp for each entry is unique. So everything would go in an array instead, and the array index could serve as a makeshift id. you could still use the dates to act on only part of the file, or just invoke your script from logrotate.

I'll assume that you're putting everything in a database first and then working with it later. this is pretty hasty, but tested and i've tried to keep it readable:

#!/usr/bin/perl

use strict;
use DBI;
use Data::Dumper;

# decide which bits of the records you want to keep

my @fields_to_store = qw(date name to file action virus);

# turn that into a hash with which to screen regex matches

my %field_ok = map { $_ => 1 } @fields_to_store;

# and two strings for the database insert statement: one of column 
# names, one with the proper number of placeholders.

my $field_list = join(',', @fields_to_store);
my $placeholders = join(',', ('?' x scalar(@fields_to_store)));

# connect to the database 

my $dsn = "DBI:mysql:database=xxxx;host=localhost";
my $dbh = DBI->connect($dsn, 'xxxx', 'xxxx', {
    'RaiseError' => 1
});

# build the instruction that will be used to insert each record 

my $insert_handle = $dbh->prepare("insert into xxxx ($field_list) valu
+es ($placeholders)");

# read the file. this %gather basket is crude, but effective 
# enough, so i offer it in the spirit of tmtowtdi

my %gather;
while(<DATA>) {

 # match data line?

    if (m/^(\w+):\s*(.+?)\s*$/ && $field_ok{lc $1}) {
        die "overwriting $1 field: broken" if exists $gather{lc $1};
        $gather{lc $1} = $2;
    }

# match dividing line?

    if (m/^-+\s*$/ && keys %gather) {

  # field order matters, of course, so use the fields_to_store array  
  # in a map{}  to order the contents of %gather, which would 
  # otherwise be jumbled
    
        $insert_handle->execute( map { $gather{lc $_} } @fields_to_sto
+re );
        print Dumper \%gather; 
        %gather = ();
    }
}

$insert_handle->finish;

__DATA__
----------------------------------
Date: 06/30/2002 00:01:21
From: pminich@foo.com
To: esquared@foofoo.com
File: value.scr
Action: The uncleanable file is deleted.
Virus: WORM_KLEZ.H
----------------------------------
Date: 06/30/2002 00:01:21
From: mef@mememe.com
To: inet@microsoft.com
File: Nr.pif
Action: The uncleanable file is deleted.
Virus: WORM_KLEZ.H
----------------------------------
[download]

For your database to be of much use you'd really need to split the email field and store that in a separate table, with another table in between that and the main one to hold the links between log entries and addresses. By that stage it would already be worth looking for something like Class::DBI to do the drudgery for you.

In reply to Re: virus log parser by thpfft
in thread virus log parser by phaedo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.