tzen has asked for the wisdom of the Perl Monks concerning the following question:

Hello, All -

First-time Perl user here. I have been tasked with creating a script that parses interlaced log files, assembles transactions from multiple lines, and then inserts them into a DB.

The complications:

1) The logs will be cut and dumped onto a box and my script is supposed to continuously parse them, insert the transactions into an Oracle DB, and move the logs to an archive.

2) Each "transaction" is composed of multiple lines.

3) The logs contain outputs from multiple threads, which are interlaced. Fortunately, the thread ID is included on each line.

4) The logs are expected to be large (250K+ transactions, potentially millions of lines or more).

The lines are formatted as such:

08/25/2009 11:29:03.991 (30)Parsing request 08/25/2009 11:29:03.991 (30)---------------------- 08/25/2009 11:29:03.991 (30)Authentication Request 08/25/2009 11:29:03.991 (30)Received From: ip=XXX port=XXX

The thread ID is a hexadecimal value within the parentheses.

Here is my code so far:

#!/bin/perl -w use strict; my @files = <*.log>; foreach my $file (@files) { my %threads = (); # get all the unique thread IDs for lines containing transaction da +ta based on the headers in one file # and put them in hash %threads open(LOGFILE,$file) or warn("Could not open log file."); while (<LOGFILE>) { if ( m/.{10}\s.{12}\s(.\w+.)Authentication Request/ | m/.{10}\s. +{12}\s(.\w+.)Authentication Response/ | m/.{10}\s.{12}\s(.\w+.)Accoun +ting Request/ ) { unless (exists $threads{ $1 }) { $threads{ $1 } = '1'; # '1' just to put something in there } } } close(LOGFILE) or warn("Could not close log file: $file."); # loop through the file once for each key in hash %threads foreach my $thread (keys (%threads)) { open(LOGFILE,$file) or warn("Could not open log file."); TRANSACTION: while (<LOGFILE>) { my %transaction = (); # if a line is found with the current thread ID and Authentic +ation Request header... if ( m/(.{10}\s.{12})\s\($thread\)Authentication Request/ ) { $transaction{ 'timestamp' } = $1; $transaction{ 'thread' } = $thread; $transaction{ 'type' } = "Request"; # read each subsequent line to build transaction while (<LOGFILE>) { if ( m/.{10}\s.{12}\s\($thread\)Acct-Session-Id : Strin +g Value = (.*$)/ ) { $transaction{ 'Acct-Session-Id' } = $1; } elsif ( m/.{10}\s.{12}\s\($thread\)User-Name : String + Value = (.*$)/ ) { $transaction{ 'User-Name' } = $1; } elsif ( m/(.{10}\s.{12})\s\($thread\)Authentication R +equest/ | m/(.{10}\s.{12})\s\($thread\)Authentication Response/ | m/( +.{10}\s.{12})\s\($thread\)Accounting Request/) { # if a new transaction header is found, do something + with the transaction and redo the line print map { "$_ => $transaction{$_}\n" } keys %trans +action; redo TRANSACTION; } } print map { "$_ => $transaction{$_}\n" } keys %transaction +; } elsif ( m/(.{10}\s.{12})\s\($thread\)Authentication Respons +e/ ) { # or Authentication Response... $transaction{ 'timestamp' } = $1; $transaction{ 'thread' } = $thread; $transaction{ 'type' } = "Reponse"; # read each subsequent line to build transaction while (<LOGFILE>) { if ( m/.{10}\s.{12}\s\($thread\)Acct-Session-Id : Strin +g Value = (.*$)/ ) { $transaction{ 'Acct-Session-Id' } = $1; } elsif ( m/.{10}\s.{12}\s\($thread\)User-Name : String + Value = (.*$)/ ) { $transaction{ 'User-Name' } = $1; } elsif ( m/(.{10}\s.{12})\s\($thread\)Authentication R +equest/ | m/(.{10}\s.{12})\s\($thread\)Authentication Response/ | m/( +.{10}\s.{12})\s\($thread\)Accounting Request/) { # if a new transaction header is found, do something + with the transaction and redo the line print map { "$_ => $transaction{$_}\n" } keys %trans +action; redo TRANSACTION; } } print map { "$_ => $transaction{$_}\n" } keys %transaction +; } elsif ( m/(.{10}\s.{12})\s\($thread\)Accounting Request/ ) +{ # or Accounting Request... $transaction{ 'timestamp' } = $1; $transaction{ 'thread' } = $thread; $transaction{ 'type' } = "Accounting"; # read each subsequent line to build transaction LINE: while (<LOGFILE>) { if ( m/.{10}\s.{12}\s\($thread\)Acct-Session-Id : Strin +g Value = (.*$)/ ) { $transaction{ 'Acct-Session-Id' } = $1; } elsif ( m/.{10}\s.{12}\s\($thread\)User-Name : String + Value = (.*$)/ ) { $transaction{ 'User-Name' } = $1; } elsif ( m/(.{10}\s.{12})\s\($thread\)Authentication R +equest/ | m/(.{10}\s.{12})\s\($thread\)Authentication Response/ | m/( +.{10}\s.{12})\s\($thread\)Accounting Request/) { # if a new transaction header is found, do something + with the transaction and redo the line print map { "$_ => $transaction{$_}\n" } keys %trans +action; redo TRANSACTION; } } print map { "$_ => $transaction{$_}\n" } keys %transaction +; } } close(LOGFILE) or warn("Could not close log file: $file."); } }

The code opens a file, scans for thread IDs from lines containing transaction headers, and stores them as a hash. Then it loops through the file as many times as there are thread IDs in the hash and assembles the transactions.

The reason there are individual loops for each transaction type is because the transactions actually contain different data, and this code is incomplete for now.

I'm concerned that reading each log file one time per thread ID is going to be too slow. I'm also not sure if I should insert individual transactions as they are assembled or store them all in an array and insert them all at once. Currently the code is just printing each transaction.

Thank you in advance for any assistance.

Replies are listed 'Best First'.
Re: Interlaced log parser
by BrowserUk (Patriarch) on Sep 03, 2009 at 21:03 UTC

    I think that I would separate the concerns by prefiltering the files and feeding each thread's lines to different processes via pipes:

    #! perl use strict; #### Warning! Untested code -- just an example. for my $log ( <*.log> ) { open my $fhLog, '<', $log or die $!; my %kids; while( <$fhLog> ) { my( $tid ) = m[\(([0-9A-F]+)\)]; if( not exists $kids{ $tid } ) { my $pid = open my $fh, '|-', 'perl non-interleaved.pl' or +die $!; $kids{ $tid } = [ $pid, $fh ]; } print { $kids{ $tid }[ 1 ] } $_; } for my $kid ( keys %kids ) { close $kid->[ 1 ]; ## close the pipe waitpid $kid->[ 0 ], 0; ##wait for the kid to finish rename $log, "archive/$log"; ## move the file } }

    This way, you don't have to accumulate lots of lines in memory in order to get complete transactions. The same simplified non-interleaved.pl deals with each threads data, and deals with just one transaction at a time.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Interlaced log parser
by SuicideJunkie (Vicar) on Sep 03, 2009 at 19:23 UTC

    Individual transactions should probably not be stored in an array, since there is no easy way to find the stored transaction data when you next see a log line about it, and need the info.
    If you have unique identifiers for each transaction (which you must), and they are not simple small sequential integers (which I'm sure they aren't), then a hash is the way to go.

    For example, you could scan through the log, and everytime you see an event for transaction $Tname, you reference $transactions{$Tname}, to find your notes on what you have seen about it so far.

    A hash of hashes of (scalars with a few lists mixed in) would be what I expect. Something like:

    $transactions = { transactionOne => { startTime => 123, userName => 'Bob', endTime => undef, foosMunged => [1, 5, 99], } transactionTwo => { ... } };
    Note: $transactions{transactionOne}{endTime} being undef, since you haven't seen the end yet, and do not know what the value should be.

    Fill in the transaction data as you see it, and once you see the end of a transaction, you can do whatever you need to do with the data you've collected, then delete the hash entry or leave it for later (Given huge files, deleting the old info is probably best). By the time one transaction ends, a few others will be partially populated, but stored safely out of the way behind their own ID keys.

Re: Interlaced log parser
by ig (Vicar) on Sep 03, 2009 at 21:06 UTC

    I would try to process the log files in a single pass. You don't give a sufficient sample of your log files to be certain, but perhaps something like the following would work.

    use strict; use warnings; my %transaction; foreach my $file (<*.log>) { open(my $fh, '<', $file) or die "$file: $!"; while(<$fh>) { chomp; if(/(.*?) \(([^\)]+)\)(.*)/) { my ($timestamp, $id, $log) = ($1, $2, $3); push(@{$transaction{$id}}, [ $timestamp, $id, $log, $file, + $. ]); my $status = status($transaction{$id}); if($status eq 'complete') { submit($transaction{$id}); delete($transaction{$id}); } elsif($status eq 'corrupt') { report($status, $transaction{$id}); delete($transaction{$id}); } } else { warn "invalid input line: $_"; } } close($fh); } foreach my $id (keys %transaction) { report('incomplete', $transaction{$id}); } exit(0); sub status { my $transaction = shift; return('incomplete') if(@$transaction < 3); return('corrupt') if( @$transaction > 3 or $transaction->[0][2] !~ /^(Authentication Request|Authenticati +on Response|Accounting Request)/ or $transaction->[1][2] !~ /^Acct-Session-Id :/ or $transaction->[2][2] !~ /^User-Name :/ ); return('complete'); } sub submit { my $transaction = shift; print "submit to database\n"; foreach my $line (@$transaction) { my ($timestamp, $id, $log, $file, $line) = @$line; print "\t$file line $line: $timestamp ($id)$log\n"; } } sub report { my ($status, $transaction) = @_; print STDERR "ERROR: $status transaction\n"; foreach my $line (@$transaction) { my ($timestamp, $id, $log, $file, $line) = @$line; print STDERR "\t$file line $line: $timestamp ($id)$log\n"; } }
Re: Interlaced log parser
by dwm042 (Priest) on Sep 03, 2009 at 21:11 UTC
    tzen,

    With a log that is potentially millions of lines long, you want to read the log file once only, or else you suffer an enormous performance penalty. SuicideJunkie has pointed out one data structure to do that. If the transaction IDs are unique per transaction, an array of transaction hashes would be a reasonable way to go. Fill in the data as it's seen.

Re: Interlaced log parser
by tzen (Initiate) on Sep 03, 2009 at 21:42 UTC
    Thanks for the replies, guys. Unfortunately, the timestamp of a line can change during a single transaction while the log is being generated, so I don't think I can rely on that + thread ID as a unique identifier. Another problem is that a transaction may start at the end of one log file and carry over onto the next log file.

      If you can identify the end of a transaction, and each thread only works on one transaction at a time, you won't have a problem.

      Use the thread ID as the hash key, and delete that entry once the transaction is complete and you have done whatever you need to do with the data. That way, you have a nice clean place to start building up info about the thread's next transaction.

      The key things:

      • Whatever scheme you use to identify a transaction when doing multiple passes of the file, use that as your top level hash key.
      • Whatever local variables you used to use to store info about the transaction, use those names as your second level hash keys
      • Just keep reading log files in order and don't worry; the inner code should not care which file the lines are coming from, just that it has a stream of lines to analyse.

      foreach $file (@files) { open my $FH, '<', $file or die "$file open says $!"; while (<$FH>) { ponder($_); $transactionHash{$threadID}{oldVariableName} = $stuff; delete $transactionHash{$threadID} if $transactionEnded; } }
Re: Interlaced log parser
by DStaal (Chaplain) on Sep 04, 2009 at 14:12 UTC

    As someone who's recently had to solve this (well, slightly different: I didn't have to do real-time) problem at work: Reading over and over is going to be slow. Probably slower than what you want to deal with.

    To actually get decent speed I finally moved the assembly of transactions into the database: I insert each line as I read it, letting the database record fill in as it gets information. (Note that you need some way to know which record each line is going to be in, and identify it uniquely over the course of at least a log file. Depending on the format of the logs, it may be worth keeping track of just the 'active' transactions.)

    Doing this allowed me to process 1-1.5GB of a data a day in a couple of hours. (Despite having to use a temp database, since the DBD install for the main database is borked beyond repair.)

Re: Interlaced log parser
by roboticus (Chancellor) on Sep 04, 2009 at 15:54 UTC
    tzen:

    I've done similar things, and I find it best to make a single pass through the file. To do so, here's the approach I take:

    First, I use a hash to contain all current transactions. (I'm assuming that each thread is handling only one task at a time, so within a thread you're not getting multiple transactions intermingled.) In this case, I'd use the thread ID as the key.

    Next, read each line. You're going to find that the line is one of:

    1. The start of a new transaction. Just emit the current transaction for the thread (if any) and start collecting the data for the new transaction.
    2. The end of a transaction (some transactions will have recognizeable ends). Emit the transaction and delete the data from the hash.
    3. Additional data for the transaction. Add the appropriate information to the hash.
    4. An explicitly ignored line (comments, blank lines, information you're not collecting, etc.)
    5. An unrecognized line, in which case you would print a warning or similar action if you care.
    6. An unexpected line, i.e., you recognize it but it's unexpected at this time. (Such as a transaction end before you get a transaction start for the thread.)

    A bit of code to illustrate:

    ...roboticus

    Update: And if I had read the entire thread, I would've noticed that ig had already given an example of how to do this. Ah, well, it happens when you don't get enough sleep. I also added the <readmore> tags, as the post was a bit longish.

      Wow, this is so helpful! Many thanks! Is it actually passing an entire transaction hash to the subroutine? Would I do the DB insert in the complete_xxx($hr) subroutines? Should I store all the complete transactions and then insert them all, or should I constantly be having it insert transactions? Again, thank you for your help.
        tzen:
        Is it actually passing an entire transaction hash to the subroutine?

        Kind of. It's passing a reference to an entire transaction hash to the subroutine. When we refer to the $$hr{type} value, it first looks up the address of the hash with the innermost $hr part. Then it looks up the type member of the hash that $hr references.

        Would I do the DB insert in the complete_xxx($hr) subroutines?

        I did in my project, but you certainly don't have to. You could build a reformatted file that you can load into your database with a bulk-loading tool, like BCP (in MS SQL).

        Should I store all the complete transactions and then insert them all, or should I constantly be having it insert transactions?

        You can do it either way. I tend to insert them as I go because some of the files I work with are large, and I don't have enough RAM to hold it all in memory. But if all your data will fit, you can do it that way if it's easier.

        ...roboticus