Greetings

Im looking for a bit of help just like everyone else. Mine is the need for blazing speed.

Background:
I work for a fairly large ISP. We have a new whiz bang mail system. I get to process the log files for the hosts responsible for dealing with the SMTP connections to the outside world, as well as the internal system. We can call them MTAs (Mail Transfer Agents). The MTAs each produce 4 log files ( 1 every 6 hours ) daily. They each have roughly 1 million lines each. I have managed to leap through flaming hoops and come up with an algorithm to deal with the files, and to spit out relevant metrics on them. I.e number of messages sent by a particular IP address through the MTA in question, and what happened to all of the messages. If a message was received, for lets say 4 recipients, if it was delivered to 2 recipients locally, and 2 were forwarded we keep track of it as such, and update the IPs info with a count of 4, rinse repeat. Then if they passed over 500 messages, we print out how many we got from them, also if they are over 1000 messages we also break down how many messages of what type they sent, I.e for say 1600 messages how many were dropped as spam, how many were delivered locally, delivered remotely, etc...

So here I am, code runs relatively fast, except its takes about say 200 - 300 seconds to process a 6 hour window. If you extrapolate this out times 4 files per host times 11 hosts within the system it works out to roughly 4 hours of processing time. There has got to be a better algorithm to deal with this problem space.

Code is included below. Couple of comments include:
I run through the file, grabbing all the messages and sorting by its message ID (which isnt always unique, but right now im ignoring that). It takes about 30 seconds to finish this process
Then I loop through my arrays of msgids and process the data set. Sometimes the data set is simply 2 lines, other times I have seen it as high as 100 lines.

So the code is below, sorry for such a long post, and thanks for whatever pointers I may receive.

PS: Sorry that I can not include any actual data. I have mocked some up, but altered the IP addresses to 192.168.x.x address space, and munged other work/customer specific data, I hope this helps some. The IP addresses and email addresses have been altered to protect my job :P. Also Im sorry the data set is so ugly in this format

Data set: --> denotes beginning of data line
set 1 -->20030123 041132266-0500 mta2 mta 25955 264 244 Note;MsgTrace(65/26) + delivered:user=rolad@localdomain.net:mbox=103999368866444:mss=mxmss0 +5:from=<blahblah@hotmail.com>:msgid=<F137MtkQhhzFSeXOfFd000201dc@hotm +ail.com>:size=3211:port=5007:fromhost=192.168.4.15.137:localAddr=[192 +.168.50.100] -->20030123 041132267-0500 mta2 mta 25955 556 261 Note;MsgTrace(65/26) + received from internet:from=<blahblah@hotmail.com>:msgid=<F137MtkQhh +zFSeXOfFd000201dc@hotmail.com>:fromhost=[192.168.15.137]:localAddr=19 +2.168.50.100:msgfile=/usr/imail/spool/control/158/20030123091131.TKZC +25955.mta2.localdomain.net@hotmail.com-Control:msgsize=3167:time=1:se +nder=<elizabethhugh80@hotmail.com>:rcpts=<rolad@localdomain.net>
Set 2
-->20030123 043157259-0500 mta2 mta 25955 107 368 Note;MsgTrace(65/26) + dropped:user=teacat@localdomain.net:mss=mxmss02:from=<fun@somespamdo +main.com>:msgid=<20030123093156.UDXV25955.mta2.localdomain.net@outbou +nd4.la.jackpot.com>:size=4252:port=6003:fromhost=192.168.22.102:local +Addr=[192.168.50.100] -->20030123 043157261-0500 mta2 mta 25955 107 616 Note;MsgTrace(65/26) + received from internet:from=<fun@somespamdomain.com>:msgid=<20030123 +093156.UDXV25955.mta2.localdomain.net@outbound4.la.jackpot.com>:fromh +ost=[192.168.22.102]:localAddr=192.168.50.100:msgfile=/usr/imail/spoo +l/control/352/20030123093156.UDXV25955.mta2.localdomain.net@outbound4 +.la.jackpot.com-Control:msgsize=4252:time=1:sender=<fun@somespamdomai +n.com>:rcpts=<teacat@localdomain.net>
Set 3
-->20030123 010337538-0500 mta2 mta 25955 80 419 Note;MsgTrace(65/26) +handled by Error-Handler:from=<aul@localdomain.net>:msgid=<2003012306 +0334.NPQE25955.mta2.localdomain.net@mail>:size=42589:desthost=mailin- +01.mx.someremotedomain.com (192.168.138.57):fromhost=192.168.162.207: +localAddr=[192.168.50.100]:msgfile=/usr/imail/spool/control/482/20030 +123060334.NPQE25955.mta2.localdomain.net@mail-Control:msgid=<20030123 +060334.NPQE25955.mta2.localdomain.net@mail> -->20030123 010338291-0500 mta2 mta 25955 117 267 Note;MsgTrace(65/26) + received from internet:from=<aul@localdomain.net>:msgid=<20030123060 +334.NPQE25955.mta2.localdomain.net@mail>:fromhost=[192.168.162.207]:l +ocalAddr=192.168.50.100:msgfile=/usr/imail/spool/control/482/20030123 +060334.NPQE25955.mta2.localdomain.net@mail-Control:msgsize=42589:time +=4:sender=<Saul@localdomain.net>:rcpts=<blackd57@someremotedomain.com +>
Set 4
-->20030123 002524982-0500 mta2 mta 25955 363 762 Note;MsgTrace(65/26) + delivered:user=rische@localdomain.net:mbox=103999336928419:mss=mxmss +05:from=<ari@someotherdomain.org>:msgid=<NEBBLMPMBNMFIKKABMDGOEEPDLAA +.ari@someotherdomain.org>:size=2454:port=5007:fromhost=192.168.10.180 +:localAddr=[192.168.50.100] -->20030123 002524984-0500 mta2 mta 25955 205 473 Note;MsgTrace(65/26) + received from internet:from=<ari@someotherdomain.org>:msgid=<NEBBLMP +MBNMFIKKABMDGOEEPDLAA.ari@someotherdomain.org>:fromhost=[192.168.10.1 +80]:localAddr=192.168.50.100:msgfile=/usr/imail/spool/control/151/200 +30123052523.MHMQ25955.mta2.localdomain.net@anguilla.alumniconnections +.com-Control:msgsize=2419:time=1:sender=<ari@someotherdomain.org>:rcp +ts=<rische@localdomain.net>
Set 5
-->20030123 014536535-0500 mta2 mta 25955 63 717 Note;MsgTrace(65/26) +bounced:user=baner3@localdomain.net:mbox=103999314773141:mss=mxmss01: +from=<specials@anotherspamdom.com>:msgid=<20030123064532.OZKU25955.mt +a2.localdomain.net@mailer72.anotherspamdom.com>:size=10791:port=5007: +fromhost=192.168.165.72:localAddr=[192.168.50.100] -->20030123 014536603-0500 mta2 mta 25955 217 408 Note;MsgTrace(65/26) + handled by Error-Handler:mss=mxmss01:from=<specials@anotherspamdom.c +om>:msgid=<20030123064532.OZKU25955.mta2.localdomain.net@mailer72.ano +therspamdom.com>:size=10750:port=5007:fromhost=192.168.165.72:localAd +dr=[192.168.50.100]:msgfile=/usr/imail/spool/control/448/200301230645 +32.OZKU25955.mta2.localdomain.net@mailer72.anotherspamdom.com-Control +:msgid=<20030123064532.OZKU25955.mta2.localdomain.net@mailer72.anothe +rspamdom.com> -->20030123 014545048-0500 mta2 mta 25955 423 314 Note;MsgTrace(65/26) + received from internet:from=<specials@anotherspamdom.com>:msgid=<200 +30123064532.OZKU25955.mta2.localdomain.net@mailer72.anotherspamdom.co +m>:fromhost=[192.168.165.72]:localAddr=192.168.50.100:msgfile=/usr/im +ail/spool/control/448/20030123064532.OZKU25955.mta2.localdomain.net@m +ailer72.anotherspamdom.com-Control:msgsize=10750:time=13:sender=<spec +ials@anotherspamdom.com>:rcpts=<bainter3@localdomain.net>
#!/usr/bin/perl $base = '/usr/local/stats/data'; $CHARS = '(<|>|\[|\])'; $| = 1; %states = ( 'bounced' => 'bounce', 'deferred' => 'queued', 'directory' => 'queued', 'delivered' => 'del_loc', 'dropped' => 'spam', 'internet' => 'del_rem', 'Handler' => 'err', 'forwarded' => 'forward', ); opendir(BASE, "$base") || die "Cant access $base\nReason: $!\n"; foreach $file ( grep(/mta2.*\.log$/, readdir(BASE)) ) { chomp($file); $real = "$base/$file"; $start = time; print "Processing: $real\n"; open(IN, "$real") || die "Cant read $real\nReason: $!\n"; while ( <IN> ) { next if (!/MsgTrace/); chomp(); # # Grab the msgid field, then take the line and stuff it into the id's +array after cleaning the # extra chars from the ID # $id = $1 if (/msgid=([^:]+):/); $id =~ s/(^<|>$)//g; if (!$id || $id =~ /^(\s+|)$/) { $no_id++; next; } push(@{$data{$id}}, $_); } close(IN); print "Finished sorting by msgid in: " . (time - $start) . " second +s\n"; foreach $id (keys %data) { $s_id = time; @data = @{$data{$id}}; $r_count = grep(/received/, @data); if (!$r_count) { # # These are internal messages about queueing and such. We wont keep me +trics on this # next; } elsif ($r_count >= 2) { # # For some reason we get multiple messages with the same ID.. Not sure + how to deal with them # yet :P # $bad_windows++; next; } else { &process_it(@data); } @data = (); undef(%{$data{$id}}); } # END foreach id keys %data print "Processed: $real in: " . (time - $start) . " seconds\n"; } print "Totals for all data files.\n"; print <<EOF; Note: Due to the methods employed to log the mta data, the breakdown by t +ype of mail will not necessarily add up to the total messages sent from an IP. The total + is accurate and the breakdown may be short data. Also again due to the method data is logged the code could not proc +ess < $bad_windows > msg_ids EOF foreach $ip ( sort { $by_ips{$a} <=> $by_ips{$b} } keys %by_ips ) { next if ($by_ips{$ip} <= 499); if ($by_ips{$ip} <= 999) { printf("%15s : %s\n", $ip, $by_ips{$ip}); } else { printf("\n%15s : %s\n", $ip, $by_ips{$ip}); %tmp = %{$by_type{$ip}}; for (keys %tmp) { printf("%15s : %s\n", $_, $tmp{$_}); } } } sub Print_Line { my($char) = shift; if ($char) { print "$char" x 80 . "\n"; } else { print '*' x 80 . "\n"; } } sub process_it { my($rec_line,$from,$tmp,$count,$line,$type,$tmp_c,$total_c,$unknown +, @in,@data,@line); @in = @_; ($rec_line) = grep(/received from internet:/, @in); ($from, $tmp) = ( split(/:/, $rec_line) )[3, -1]; $from =~ s/(fromhost=|$CHARS)//g; # # We convert the @ signs to @ signs, and get back how many times it ha +ppened in the string # in question. This way we know how many people the email went to. # $count = $tmp =~ s/@/@/g; # # Grab our data set now, Sometimes we have data sets with a single Err +or-Handler line, # other times we have error-handler, and an actual breakdown of what h +appened to the message # ie deferred, dropped, bounced, etc.. so if its only one line we leav +e it alone, else we # whack the error-handler lines # (@data) = grep(!/received from internet:/, @in); (@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, +@data) && $data[1] ); foreach $line (@data) { @line = split(/:/, $line); if ($line[0] =~ /( |-)([a-zA-Z]+)$/) { $type = $states{$2}; warn "NO type: $line[0]\n" if ($type =~ /^(\s+|)$/); if ($type !~ /del_rem/) { $tmp_c++; } else { # # this is the same as the above, but we are counting how many people w +e sent to outside of # our mail system # $tmp_c += $line[$#line] =~ s/@/@/g; } } $by_type{$from}{$type} += $tmp_c; $total_c += $tmp_c; undef($tmp_c); } $by_ips{$from} += $total_c; }


/* And the Creator, against his better judgement, wrote man.c */

Edit by tye, put CODE tags around extremely long "words", add READMORE


In reply to The need for speed by l2kashe

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.