in reply to Interlaced log parser

tzen:

I've done similar things, and I find it best to make a single pass through the file. To do so, here's the approach I take:

First, I use a hash to contain all current transactions. (I'm assuming that each thread is handling only one task at a time, so within a thread you're not getting multiple transactions intermingled.) In this case, I'd use the thread ID as the key.

Next, read each line. You're going to find that the line is one of:

  1. The start of a new transaction. Just emit the current transaction for the thread (if any) and start collecting the data for the new transaction.
  2. The end of a transaction (some transactions will have recognizeable ends). Emit the transaction and delete the data from the hash.
  3. Additional data for the transaction. Add the appropriate information to the hash.
  4. An explicitly ignored line (comments, blank lines, information you're not collecting, etc.)
  5. An unrecognized line, in which case you would print a warning or similar action if you care.
  6. An unexpected line, i.e., you recognize it but it's unexpected at this time. (Such as a transaction end before you get a transaction start for the thread.)

A bit of code to illustrate:

my %TxnQ = (); while (<DATA>) { ##### TRANSACTION HEADERS ##### # HEADER: emit previous transaction (if any), start new one if ( m/(.{10}\s.{12})\s\((\d+)\)Authentication Request/ ) { # Emit previous transaction, if any complete_transaction($TxnQ{$2}) if exists $TxnQ{$2}; # Delete previous data by replacing with new data $TxnQ{$2} = (timestamp=>$1, type='Request'); } # ...etc... ##### INTERMEDIATE LINES ##### elsif ( m/.{10}\s.{12}\s\((\d+)\)Acct-Session-Id : String Value = +(.*$)/ ) { # Just add the additional data to the threads transaction reco +rd $TxnQ{$1}{'Acct-Session-Id'} = $2; } # ...etc... ##### TRANSACTION TERMINATORS ##### elsif ( m/.{10}\s.{12}\s\((\d+)\)User-Name : String Value = (.*$)/ + ) { # Add the final data item(s) (if req'd) $TxnQ{$1}{'User-Name'} = $2; # Process the transaction complete_transaction($TxnQ{$1}); # Delete the data delete $TxnQ{$1}; } # ...etc... ##### LINES WE DON'T CARE ABOUT ##### elsif ( m/frammistat/ | m/^\s*$/ | m/^#/ ) { # DO NOTHING We're explicitly ignoring these lines } else { print "LINE $.: Unrecognized line. Complete text:\n$_"; } } # Complete remaining transactions (hopefully complete transactions # that don't have explicit transaction terminator lines) for (keys %TxnQ) { complete_transaction($TxnQ{$_}); } sub complete_transaction { my $hr = shift; if (!defined $$hr{type}) { print "Incomplete transaction found!\n"; } elsif ($$hr{type} eq 'Request') { complete_request($hr) } elsif ($$hr{type} eq 'Response') { complete_response($hr) } # ...etc... else { print "ERROR: Unexpected transaction type: $$hr{type}!\n"; } }

Obviously, you'd need to add error handling and such as you see fit. Standard disclaimers apply: Untested code, use at your own risk, if it breaks you can keep all the pieces, etc. ad nauseum.

...roboticus

Update: And if I had read the entire thread, I would've noticed that ig had already given an example of how to do this. Ah, well, it happens when you don't get enough sleep. I also added the <readmore> tags, as the post was a bit longish.

Replies are listed 'Best First'.
Re^2: Interlaced log parser
by tzen (Initiate) on Sep 04, 2009 at 21:20 UTC
    Wow, this is so helpful! Many thanks! Is it actually passing an entire transaction hash to the subroutine? Would I do the DB insert in the complete_xxx($hr) subroutines? Should I store all the complete transactions and then insert them all, or should I constantly be having it insert transactions? Again, thank you for your help.
      tzen:
      Is it actually passing an entire transaction hash to the subroutine?

      Kind of. It's passing a reference to an entire transaction hash to the subroutine. When we refer to the $$hr{type} value, it first looks up the address of the hash with the innermost $hr part. Then it looks up the type member of the hash that $hr references.

      Would I do the DB insert in the complete_xxx($hr) subroutines?

      I did in my project, but you certainly don't have to. You could build a reformatted file that you can load into your database with a bulk-loading tool, like BCP (in MS SQL).

      Should I store all the complete transactions and then insert them all, or should I constantly be having it insert transactions?

      You can do it either way. I tend to insert them as I go because some of the files I work with are large, and I don't have enough RAM to hold it all in memory. But if all your data will fit, you can do it that way if it's easier.

      ...roboticus