gulden has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need your help to find a fast algorithm to process log files, finding the pairs and then generate a new logfile joining the pairs information in a single entry.

I have to process log files that are generated per minute, so my processing time should be below a minute.

An example of the logfiles:

FILE_MINUTE_0.log
20090619 ID = 0 TIMESTAMP = 1244127600 COUNT_2 = 10
FILE_MINUTE_1.log
20090619 ID = 0 TIMESTAMP = 1244127600 COUNT_1 = 10 20090619 ID = 1 TIMESTAMP = 1244127600 COUNT_1 = 10 20090619 ID = 1 TIMESTAMP = 1244127600 COUNT_2 = 10 20090619 ID = 2 TIMESTAMP = 1244127600 COUNT_1 = 10 20090619 ID = 3 TIMESTAMP = 1244127600 COUNT_2 = 10 20090619 ID = 4 TIMESTAMP = 1244127600 COUNT_1 = 10
FILE_MINUTE_2.log
20090619 ID = 4 TIMESTAMP = 1244127600 COUNT_2 = 10

I've to process the above files and generate a new one joining the entries with the same ID. This would be the final Log File after processing "FILE_MINUTE_1.log"

FINAL_FILE_1.log
20090619 ID = 0 TIMESTAMP = 1244127600 TOTAL = COUNT_1 + COUNT_2 20090619 ID = 1 TIMESTAMP = 1244127600 TOTAL = COUNT_1 + COUNT_2 20090619 ID = 2 TIMESTAMP = 1244127600 TOTAL = COUNT_1 --> COUNT_2 does Not exist 20090619 ID = 4 TIMESTAMP = 1244127600 TOTAL = COUNT_1 + COUNT_2

From the above file the record with ID=3 doesn't appear since its a pair without the entry "COUNT_1"

My approaches:

1. Put the log entries into a temporary SQL table, and then generate the output based on a Query that joins the entries by ID.

2. Put All the log entries into HASH TABLE like this:

$hash = ( 'FILE_MINUTE_2' => { 1 => {ID=1, COUNT_1 = 10, COUNT_2 = 20}, 2 => {ID=2, COUNT_1 = 10, COUNT_2 = undef}, 3 => {ID=3, COUNT_1 = undef, COUNT_2 = 20}, 4 => {ID=4, COUNT_1 = 10, COUNT_2 = 20}, }, );

And after reading the LogFiles, then I just need to process the HASH TABLE in order to generate the output file.

Another considerations:

Any tip will be helpful. Tks

Replies are listed 'Best First'.
Re: Process Log Files - Join Log entry pairs
by Utilitarian (Vicar) on Jun 17, 2009 at 12:52 UTC
    while (<$FILE>){ if (/^\d/){ # new entry, sum values with those of $hash{$id}}and increment +$hash{$id}->count } elsif(/^\s+ID/){ @fields=split /\s+/; $id=$fields[3]; } ...
    This type of structure would allow you to build your structure and then you can sum your values over the files by summing for each hash an element of the array (keys %hash).

    Edit: fixed elsif typo Transient pointed out below.

      A few things with that... while (<$FILE>) {
      would assume you had a variable that is a glob of a filehandle. I'm not sure why you'd want to do that, just a caveat.

      elif should be elsif

      and your code also assumes both a single line read (in the /^\d/ case) and a multi-line read (in the /^\s+ID/ case), unless I'm reading that incorrectly.

      Here's what I came up with:
      use Data::Dumper; use strict; my $hash = {}; my @files = ( "LOG1", "LOG2", "LOG3" ); foreach my $file ( @files ) { open FILE, "<", $file or die "Unable to open $file\n$!\n"; my ( $date, $id, $timestamp ) = qw( Unknown "" "" ); while (<FILE>) { chomp; if (/^\d/){ $date = $_; } elsif (/^\s+ID\s*=\s*(.*)$/) { $id = $1; } elsif ( /^\s+TIMESTAMP\s*=\s*(.*)$/ ) { $timestamp = $1; } elsif ( /^\s+COUNT_\d+\s*=\s*(.*)$/ ) { $hash->{$date}->{$id}->{$timestamp} += $1; } } } print Dumper($hash), "\n";
      which would give you
      $VAR1 = { '20090619' => { '4' => { '1244127600' => '20' }, '1' => { '1244127600' => '20' }, '3' => { '1244127600' => '10' }, '0' => { '1244127600' => '20' }, '2' => { '1244127600' => '10' } } };
      Assumptions are that the log files are all in the order specified, the actual COUNT_X values are immaterial, etc.
Re: Process Log Files - Join Log entry pairs
by gulden (Monk) on Jun 23, 2009 at 21:47 UTC
    Tks for the tips.

    But if I want to use Threads to process multiple files and use the same hash, how can you do it and keep the performance?

    use threads::shared; my %hash : shared;

    The above hash declaration doesn't work for nested Hash's. Any tips?