Process Log Files - Join Log entry pairs

gulden has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need your help to find a fast algorithm to process log files, finding the pairs and then generate a new logfile joining the pairs information in a single entry.

I have to process log files that are generated per minute, so my processing time should be below a minute.

An example of the logfiles:

FILE_MINUTE_0.log

20090619
   ID = 0
   TIMESTAMP = 1244127600
   COUNT_2 = 10
[download]

FILE_MINUTE_1.log

20090619
   ID = 0
   TIMESTAMP = 1244127600
   COUNT_1 = 10
20090619
   ID = 1
   TIMESTAMP = 1244127600
   COUNT_1 = 10
20090619
   ID = 1
   TIMESTAMP = 1244127600
   COUNT_2 = 10
20090619
   ID = 2
   TIMESTAMP = 1244127600
   COUNT_1 = 10
20090619
   ID = 3
   TIMESTAMP = 1244127600
   COUNT_2 = 10

20090619
   ID = 4
   TIMESTAMP = 1244127600
   COUNT_1 = 10
[download]

FILE_MINUTE_2.log

20090619
   ID = 4
   TIMESTAMP = 1244127600
   COUNT_2 = 10
[download]

I've to process the above files and generate a new one joining the entries with the same ID. This would be the final Log File after processing "FILE_MINUTE_1.log"

FINAL_FILE_1.log

20090619
   ID = 0
   TIMESTAMP = 1244127600
   TOTAL = COUNT_1 + COUNT_2
20090619
   ID = 1
   TIMESTAMP = 1244127600
   TOTAL = COUNT_1 + COUNT_2
20090619
   ID = 2
   TIMESTAMP = 1244127600
   TOTAL = COUNT_1    --> COUNT_2 does Not exist
20090619
   ID = 4
   TIMESTAMP = 1244127600
   TOTAL = COUNT_1 + COUNT_2
[download]

From the above file the record with ID=3 doesn't appear since its a pair without the entry "COUNT_1"

My approaches:

1. Put the log entries into a temporary SQL table, and then generate the output based on a Query that joins the entries by ID.

2. Put All the log entries into HASH TABLE like this:

$hash = (
   'FILE_MINUTE_2' => { 1 => {ID=1, COUNT_1 = 10, COUNT_2 = 20},
                        2 => {ID=2, COUNT_1 = 10, COUNT_2 = undef},
                        3 => {ID=3, COUNT_1 = undef, COUNT_2 = 20},
                        4 => {ID=4, COUNT_1 = 10, COUNT_2 = 20},
                      },
                      

);
[download]

And after reading the LogFiles, then I just need to process the HASH TABLE in order to generate the output file.

Another considerations:

The LogFiles have much more information than what I've posted in the samples.
Size of the Log Files ~ 20MB
Unique ID's per Log File ~ 10.000

Any tip will be helpful. Tks

Comment on Process Log Files - Join Log entry pairs Select or Download Code

Replies are listed 'Best First'.
Re: Process Log Files - Join Log entry pairs by Utilitarian (Vicar) on Jun 17, 2009 at 12:52 UTC
`while (<$FILE>){ if (/^\d/){ # new entry, sum values with those of $hash{$id}}and increment +$hash{$id}->count } elsif(/^\s+ID/){ @fields=split /\s+/; $id=$fields[3]; } ...` [download] This type of structure would allow you to build your structure and then you can sum your values over the files by summing for each hash an element of the array (keys %hash). Edit: fixed elsif typo Transient pointed out below.	[reply] [d/l]
Re^2: Process Log Files - Join Log entry pairs by Transient (Hermit) on Jun 17, 2009 at 14:24 UTC
A few things with that... `while (<$FILE>) {` would assume you had a variable that is a glob of a filehandle. I'm not sure why you'd want to do that, just a caveat. `elif` should be `elsif` and your code also assumes both a single line read (in the `/^\d/` case) and a multi-line read (in the `/^\s+ID/` case), unless I'm reading that incorrectly. Here's what I came up with: `use Data::Dumper; use strict; my $hash = {}; my @files = ( "LOG1", "LOG2", "LOG3" ); foreach my $file ( @files ) { open FILE, "<", $file or die "Unable to open $file\n$!\n"; my ( $date, $id, $timestamp ) = qw( Unknown "" "" ); while (<FILE>) { chomp; if (/^\d/){ $date = $_; } elsif (/^\s+ID\s=\s(.)$/) { $id = $1; } elsif ( /^\s+TIMESTAMP\s=\s(.)$/ ) { $timestamp = $1; } elsif ( /^\s+COUNT_\d+\s=\s(.*)$/ ) { $hash->{$date}->{$id}->{$timestamp} += $1; } } } print Dumper($hash), "\n";` [download] which would give you `$VAR1 = { '20090619' => { '4' => { '1244127600' => '20' }, '1' => { '1244127600' => '20' }, '3' => { '1244127600' => '10' }, '0' => { '1244127600' => '20' }, '2' => { '1244127600' => '10' } } };` [download] Assumptions are that the log files are all in the order specified, the actual COUNT_X values are immaterial, etc.	[reply] [d/l] [select]
Re: Process Log Files - Join Log entry pairs by gulden (Monk) on Jun 23, 2009 at 21:47 UTC
Tks for the tips. But if I want to use Threads to process multiple files and use the same hash, how can you do it and keep the performance? `use threads::shared; my %hash : shared;` [download] The above hash declaration doesn't work for nested Hash's. Any tips?	[reply] [d/l]