Structuring maillog data - hopefully simple question on arrays/hashes

billie_t has asked for the wisdom of the Perl Monks concerning the following question:

Since I am not graced with a scripting brain, and I'm not that great with Perl, I'm having a number of issues trying to report on (Postfix) mail that has originated from certain hosts in our domain destined for other specific domains. I have the regexps working (they're probably very ugly, but who cares), but I have problems associating the two message ids that each message has due to our antispam solution, while performing separate tests on them.

The line in the maillog that mentions both message ids is like this:

Jun  7 13:47:56 smtpserver postfix/smtp[16346]: A9507208022: to=<a.use
+r@somwhere.gov>, relay=localhost[127.0.0.1], delay=0, status=sent (25
+0 Ok: queued as B68C8208095)
[download]

The 11-12 character hex-looking strings are the message ids. In an earlier line of the maillog, the first message ID tells me which client/host that the message originated from:

Jun  7 13:47:56 smtpserver postfix/smtpd[12725]: A9507208022: client=o
+urhost1.our.domain[172.111.111.111]
[download]

The second message ID eventually shows up again in a maillog line which shows the message being accepted by the destination server:

Jun  7 13:47:57 smtpserver postfix/smtp[12379]: B68C8208095: to=<a.use
+r@Somewhere.gov>, relay=server.somwhere.gov[10.0.100.100], delay=1, s
+tatus=sent (250 ok:  Message 2156237 accepted)
[download]

If a message ID has not originated from the correct host, I want to discard both associated IDs. If it has originated from the correct hosts, I want to want to print the line that shows the delivery. I've gotten as far as figuring out I need to run through the maillog a few times to gather the MSGIDs, discard the ones from the incorrect hosts, and finally print the lines I want. I assume slurping each 75MB file might be a bit much (although I'd be happy to try it).

In short, how do I do the magic in the middle? Is the best thing to use a hash to hold both message IDs? Do I use the first message ID as the "key", and delete it if it's from the wrong client? ie:

foreach $line ( <FH> ) {                
   if ( $line !~ /to=<.*(our\.domain)/i) {    
# our external domain is a subset of the ones we're looking for
        if ($line =~ /to=<.*\.gov>.*relay=localhost/i) {
            @ids = ($line =~ /\ ([0-9A-F]{11,12})/g);
            %msgids{$ids[0]} = $ids[1];
        }
    }
}  

open (FH2, $file ) or  die "Cannot open file 2nd time: $!";
foreach $line ( <FH2> ) {
    if  ( $line !~ /client=(ourhost1|ourhost2)/ ) {
        foreach $keys (%msgids) {
            if ($key =~ /$line/) {
                delete $msgids $key;
            }
      }
}
# then do something to run through the log again and print
# the delivery report line from the second msgid
[download]

Is it better to do a while to run through the IDs doing the rest of the matching or the foreach or what? Or is it a TIMTOWDI situation?

TIA for sanity-checking and input.

Comment on Structuring maillog data - hopefully simple question on arrays/hashes Select or Download Code

Replies are listed 'Best First'.
Re: Structuring maillog data - hopefully simple question on arrays/hashes by McDarren (Abbot) on Jun 11, 2007 at 11:47 UTC
Just a couple of general comments, which may or may not be useful ;) Firstly, I don't really see the need to run through your log file more than once. Also, I think just a single hash (a HOH, actually) would suffice for your purposes. I would use the originating message ID as the key to the hash, and then have several sub-keys such as "destination_id", "status", "source_host", "destination_host", etc. And then just populate the values for each of these as you run through the log file. Once you are done, it's just a matter of iterating through your hash keys and outputting those of interest. The second point is yes, it would be better to use a while loop to read through your log file. By using a foreach loop, you are effectively slurping the whole file into memory. Whereas a while loop will just read line by line. So something like: `while (my $line = <FH>) { chomp($line); # you probably want to do this # do whatever with contents of $line }` [download] Hope this helps, Darren :)	[reply] [d/l]
Re^2: Structuring maillog data - hopefully simple question on arrays/hashes by billie_t (Sexton) on Jun 12, 2007 at 06:52 UTC
An HoH is one of those magical incantations I have not got the hang of yet. But after your suggesting I could just go through the log once, and use while loops in preference (I didn't know that a foreach grabs everything into memory), I managed to nut out a way of grabbing the right info at the right time (I've stuck my mucky code at the bottom of the thread). Thanks very much for your input; it was a great help.	[reply]
Re: Structuring maillog data - hopefully simple question on arrays/hashes by cdarke (Prior) on Jun 11, 2007 at 14:33 UTC
If I may add, be careful of this construct: `foreach $keys (%msgids) { if ($key =~ /$line/) { delete $msgids $key; } }` [download] You probably meant: `foreach $key (keys %msgids) { if ($key =~ /$line/) { delete $msgids{$key} } }` [download] This will probably create a large temporary list in memory (keys %msgids) (and you seem to be missing a close brace). A better way to iterate through a hash is to use each: `while (($key, $value) = each(%msgids)) { if ($key =~ /$line/) { delete $msgids{$key} } }` [download]	[reply] [d/l] [select]
Re^2: Structuring maillog data - hopefully simple question on arrays/hashes by billie_t (Sexton) on Jun 12, 2007 at 06:57 UTC
Ack, yes, my grasp of hashes is shaky at best, and it helps if I use the correct syntax! That while loop syntax is great - it looks much better, and I made heavy use of it (in my ugly script below). Thanks for the help!	[reply]
Re: Structuring maillog data - hopefully simple question on arrays/hashes by billie_t (Sexton) on Jun 12, 2007 at 06:47 UTC
Thanks to the excellent input I received, I made something that appears to work - it may be ugly, but it's saved me a big job. The interesting part is that the client host (originating server) is only mentioned with reference to the first message ID, and the external delivery status only goes with the second message ID. The line that has both IDs comes halfway through the logging sequence (and other unrelated message transactions are interpolated in between). #!/usr/bin/perl use warnings; use strict; my @logfiles = qw(maillog.4 maillog.5 maillog.6 [..]); my $line; my $file; my $key; my $value; open (CLIENTOUT, ">>client.txt") or die "Cannot open output file: $!"; for $file ( @logfiles ) { open (FH, $file ) or die "Cannot open file: $!"; print "$file\n"; my %msgids; while (my $line = <FH>) { chomp($line); if ( $line =~ /client=(ourhost1\|ourhost2)/ ) { no warnings 'uninitialized'; #I don't care about warnings about something uninitialised in line 23 my @id = ($line =~ /\_([0-9A-F]{11,12})/); $msgids{ $id[0] } = (); } if ( $line =~ /to=<.\.gov>.relay=localhost/ ) { if ( $line !~ /to=<.(ourdomain\.gov)/i) { while ( ($key, $value) = each(%msgids) ) { if ($line =~ /$key/) { my @id = ($line =~ /\ ([0-9A-F]{11,12})/g); $msgids{$key} = $id[1]; } } } } if ( $line =~ /to=<.\.gov.status=sent/ ) { if ( $line !~ /to=<.(ourdomain\.gov\|relay=localhost)/i) { while ( ($key, $value) = each(%msgids) ) { if ( defined $value ) { if ($line =~ /$value/) { print CLIENTOUT "$line\n"; } } } } } } close FH; } [download]	[reply] [d/l]
Re^2: Structuring maillog data - hopefully simple question on arrays/hashes by McDarren (Abbot) on Jul 01, 2007 at 05:27 UTC
(Moving this off my pad and into here - where it belongs) Try running the following code. I think it meets most of your requirements. To be perfectly honest, it's probably not much better than your existing code - but it does demonstrate a different approach, and the use of a HOH which I mentioned in my initial reply. If there are any parts of it that need explaining, feel free to ask :) #!/usr/bin/perl use strict; use warnings; use Data::Dumper; # Replace the following with the location of your maillog # (Don't forget to include the full path if it isn't in the cwd) my $mail_log = 'billie_t.log'; my %messages; my $valid_src = 'exch'; my $valid_dst = 'gov.au'; my $verbose = 1; # Change this to 1 if you want to see which lines +are ignored open my $mail_fh, '<', $mail_log or die "Cannot open $mail_log:$!\n"; while (my $line = <$mail_fh>) { chomp($line); if ($line =~ /qmgr.?from=/) { # It's a new mail my ($id, $from) = (split /\s+/, $line)[5,6]; chop($id); # remove the colon $from =~ s/^from=<(.?)>,/$1/; $messages{$id}{from} = $from; } elsif ($line =~ /client=/) { my ($id, $client) = (split /\s+/, $line)[5,6]; chop($id); $client =~ s/client=(.)/$1/; $messages{$id}{client} = $client; } elsif ($line =~ /relay=localhost/) { my ($id, $to, $status) = (split /\s+/, $line, 10)[5,6,9]; chop($id); $to =~ s/to=<(.?)>,/$1/; my ($remote_id) = $status =~ m/([0-9A-Z]+)\)$/; $status =~ s/.?$(.?)$$/$1/; $messages{$id}{to} = $to; $messages{$id}{status} = $status; $messages{$id}{remote_id} = $remote_id; } else { print "Skipping... $line\n" if $verbose; } } for my $id (sort keys %messages) { no warnings 'uninitialized'; next if (!$messages{$id}{client} \|\| $messages{$id}{client} !~ /\Q$ +valid_src\E/); next if (!$messages{$id}{to} \|\| $messages{$id}{to} !~ /\Q$valid_ds +t\E/); print qq(Valid message found: ID=$id\n\tFROM=$messages{$id}{from}\ +n\t), qq(TO=$messages{$id}{to}\n\tSTATUS=$messages{$id}{status}\n\ +t), qq(CLIENT=$messages{$id}{client}\n\tREMOTE_ID=$messages{$id} +{remote_id}\n\n); } # Uncomment the following line to get a better picture of the datastru +cture that is created # (You might want to redirect the output to a file) #print Dumper(\%messages); close $mail_fh; [download]	[reply] [d/l]