The need for speed

l2kashe has asked for the wisdom of the Perl Monks concerning the following question:

Greetings

Im looking for a bit of help just like everyone else. Mine is the need for blazing speed.

Background:
I work for a fairly large ISP. We have a new whiz bang mail system. I get to process the log files for the hosts responsible for dealing with the SMTP connections to the outside world, as well as the internal system. We can call them MTAs (Mail Transfer Agents). The MTAs each produce 4 log files ( 1 every 6 hours ) daily. They each have roughly 1 million lines each. I have managed to leap through flaming hoops and come up with an algorithm to deal with the files, and to spit out relevant metrics on them. I.e number of messages sent by a particular IP address through the MTA in question, and what happened to all of the messages. If a message was received, for lets say 4 recipients, if it was delivered to 2 recipients locally, and 2 were forwarded we keep track of it as such, and update the IPs info with a count of 4, rinse repeat. Then if they passed over 500 messages, we print out how many we got from them, also if they are over 1000 messages we also break down how many messages of what type they sent, I.e for say 1600 messages how many were dropped as spam, how many were delivered locally, delivered remotely, etc...

So here I am, code runs relatively fast, except its takes about say 200 - 300 seconds to process a 6 hour window. If you extrapolate this out times 4 files per host times 11 hosts within the system it works out to roughly 4 hours of processing time. There has got to be a better algorithm to deal with this problem space.

Code is included below. Couple of comments include:
I run through the file, grabbing all the messages and sorting by its message ID (which isnt always unique, but right now im ignoring that). It takes about 30 seconds to finish this process
Then I loop through my arrays of msgids and process the data set. Sometimes the data set is simply 2 lines, other times I have seen it as high as 100 lines.

So the code is below, sorry for such a long post, and thanks for whatever pointers I may receive.

PS: Sorry that I can not include any actual data. I have mocked some up, but altered the IP addresses to 192.168.x.x address space, and munged other work/customer specific data, I hope this helps some. The IP addresses and email addresses have been altered to protect my job :P. Also Im sorry the data set is so ugly in this format

Data set: --> denotes beginning of data line

set 1
-->20030123 041132266-0500 mta2 mta 25955 264 244 Note;MsgTrace(65/26)
+ delivered:user=rolad@localdomain.net:mbox=103999368866444:mss=mxmss0
+5:from=<blahblah@hotmail.com>:msgid=<F137MtkQhhzFSeXOfFd000201dc@hotm
+ail.com>:size=3211:port=5007:fromhost=192.168.4.15.137:localAddr=[192
+.168.50.100]
-->20030123 041132267-0500 mta2 mta 25955 556 261 Note;MsgTrace(65/26)
+ received from internet:from=<blahblah@hotmail.com>:msgid=<F137MtkQhh
+zFSeXOfFd000201dc@hotmail.com>:fromhost=[192.168.15.137]:localAddr=19
+2.168.50.100:msgfile=/usr/imail/spool/control/158/20030123091131.TKZC
+25955.mta2.localdomain.net@hotmail.com-Control:msgsize=3167:time=1:se
+nder=<elizabethhugh80@hotmail.com>:rcpts=<rolad@localdomain.net>
[download]

Set 2

-->20030123 043157259-0500 mta2 mta 25955 107 368 Note;MsgTrace(65/26)
+ dropped:user=teacat@localdomain.net:mss=mxmss02:from=<fun@somespamdo
+main.com>:msgid=<20030123093156.UDXV25955.mta2.localdomain.net@outbou
+nd4.la.jackpot.com>:size=4252:port=6003:fromhost=192.168.22.102:local
+Addr=[192.168.50.100]
-->20030123 043157261-0500 mta2 mta 25955 107 616 Note;MsgTrace(65/26)
+ received from internet:from=<fun@somespamdomain.com>:msgid=<20030123
+093156.UDXV25955.mta2.localdomain.net@outbound4.la.jackpot.com>:fromh
+ost=[192.168.22.102]:localAddr=192.168.50.100:msgfile=/usr/imail/spoo
+l/control/352/20030123093156.UDXV25955.mta2.localdomain.net@outbound4
+.la.jackpot.com-Control:msgsize=4252:time=1:sender=<fun@somespamdomai
+n.com>:rcpts=<teacat@localdomain.net>
[download]

Set 3

-->20030123 010337538-0500 mta2 mta 25955 80 419 Note;MsgTrace(65/26) 
+handled by Error-Handler:from=<aul@localdomain.net>:msgid=<2003012306
+0334.NPQE25955.mta2.localdomain.net@mail>:size=42589:desthost=mailin-
+01.mx.someremotedomain.com (192.168.138.57):fromhost=192.168.162.207:
+localAddr=[192.168.50.100]:msgfile=/usr/imail/spool/control/482/20030
+123060334.NPQE25955.mta2.localdomain.net@mail-Control:msgid=<20030123
+060334.NPQE25955.mta2.localdomain.net@mail>
-->20030123 010338291-0500 mta2 mta 25955 117 267 Note;MsgTrace(65/26)
+ received from internet:from=<aul@localdomain.net>:msgid=<20030123060
+334.NPQE25955.mta2.localdomain.net@mail>:fromhost=[192.168.162.207]:l
+ocalAddr=192.168.50.100:msgfile=/usr/imail/spool/control/482/20030123
+060334.NPQE25955.mta2.localdomain.net@mail-Control:msgsize=42589:time
+=4:sender=<Saul@localdomain.net>:rcpts=<blackd57@someremotedomain.com
+>
[download]

Set 4

-->20030123 002524982-0500 mta2 mta 25955 363 762 Note;MsgTrace(65/26)
+ delivered:user=rische@localdomain.net:mbox=103999336928419:mss=mxmss
+05:from=<ari@someotherdomain.org>:msgid=<NEBBLMPMBNMFIKKABMDGOEEPDLAA
+.ari@someotherdomain.org>:size=2454:port=5007:fromhost=192.168.10.180
+:localAddr=[192.168.50.100]
-->20030123 002524984-0500 mta2 mta 25955 205 473 Note;MsgTrace(65/26)
+ received from internet:from=<ari@someotherdomain.org>:msgid=<NEBBLMP
+MBNMFIKKABMDGOEEPDLAA.ari@someotherdomain.org>:fromhost=[192.168.10.1
+80]:localAddr=192.168.50.100:msgfile=/usr/imail/spool/control/151/200
+30123052523.MHMQ25955.mta2.localdomain.net@anguilla.alumniconnections
+.com-Control:msgsize=2419:time=1:sender=<ari@someotherdomain.org>:rcp
+ts=<rische@localdomain.net>
[download]

Set 5

-->20030123 014536535-0500 mta2 mta 25955 63 717 Note;MsgTrace(65/26) 
+bounced:user=baner3@localdomain.net:mbox=103999314773141:mss=mxmss01:
+from=<specials@anotherspamdom.com>:msgid=<20030123064532.OZKU25955.mt
+a2.localdomain.net@mailer72.anotherspamdom.com>:size=10791:port=5007:
+fromhost=192.168.165.72:localAddr=[192.168.50.100]
-->20030123 014536603-0500 mta2 mta 25955 217 408 Note;MsgTrace(65/26)
+ handled by Error-Handler:mss=mxmss01:from=<specials@anotherspamdom.c
+om>:msgid=<20030123064532.OZKU25955.mta2.localdomain.net@mailer72.ano
+therspamdom.com>:size=10750:port=5007:fromhost=192.168.165.72:localAd
+dr=[192.168.50.100]:msgfile=/usr/imail/spool/control/448/200301230645
+32.OZKU25955.mta2.localdomain.net@mailer72.anotherspamdom.com-Control
+:msgid=<20030123064532.OZKU25955.mta2.localdomain.net@mailer72.anothe
+rspamdom.com>
-->20030123 014545048-0500 mta2 mta 25955 423 314 Note;MsgTrace(65/26)
+ received from internet:from=<specials@anotherspamdom.com>:msgid=<200
+30123064532.OZKU25955.mta2.localdomain.net@mailer72.anotherspamdom.co
+m>:fromhost=[192.168.165.72]:localAddr=192.168.50.100:msgfile=/usr/im
+ail/spool/control/448/20030123064532.OZKU25955.mta2.localdomain.net@m
+ailer72.anotherspamdom.com-Control:msgsize=10750:time=13:sender=<spec
+ials@anotherspamdom.com>:rcpts=<bainter3@localdomain.net>
[download]

#!/usr/bin/perl

$base  = '/usr/local/stats/data';
$CHARS = '(<|>|\[|\])';

$| = 1;

%states = (
   'bounced'   => 'bounce',
   'deferred'  => 'queued',
   'directory' => 'queued',
   'delivered' => 'del_loc',
   'dropped'   => 'spam',
   'internet'  => 'del_rem',
   'Handler'   => 'err',
   'forwarded' => 'forward',
);

opendir(BASE, "$base") || die "Cant access $base\nReason: $!\n";

foreach $file ( grep(/mta2.*\.log$/, readdir(BASE)) ) {
   chomp($file);

   $real  = "$base/$file";
   $start = time;

   print "Processing: $real\n";
   open(IN, "$real") || die "Cant read $real\nReason: $!\n";

   while ( <IN> ) {
      next if (!/MsgTrace/);
      chomp();
#
# Grab the msgid field, then take the line and stuff it into the id's 
+array after cleaning the
# extra chars from the ID
#
      $id = $1 if (/msgid=([^:]+):/);

      $id =~ s/(^<|>$)//g;

      if (!$id || $id =~ /^(\s+|)$/) {
         $no_id++;
         next;
      }

      push(@{$data{$id}}, $_);
   }
   close(IN);

   print "Finished sorting by msgid in: " . (time - $start) . " second
+s\n";

   foreach $id (keys %data) {
      $s_id    = time;
      @data    = @{$data{$id}};
      $r_count = grep(/received/, @data);

      if (!$r_count) {
#
# These are internal messages about queueing and such. We wont keep me
+trics on this
#
         next;

      } elsif ($r_count >= 2) {
# 
# For some reason we get multiple messages with the same ID.. Not sure
+ how to deal with them
# yet :P
#
         $bad_windows++;
         next;

      } else {
         &process_it(@data);
      }

      @data = ();
      undef(%{$data{$id}});

   } # END foreach id keys %data

   print "Processed: $real in: " . (time - $start) . " seconds\n";

}

print "Totals for all data files.\n";
print <<EOF;
Note: 
   Due to the methods employed to log the mta data, the breakdown by t
+ype of mail will not 
   necessarily add up to the total messages sent from an IP. The total
+ is accurate and the
   breakdown may be short data. 

   Also again due to the method data is logged the code could not proc
+ess < $bad_windows > msg_ids

EOF

foreach $ip ( sort { $by_ips{$a} <=> $by_ips{$b} } keys %by_ips ) {
   next if ($by_ips{$ip} <= 499);

   if ($by_ips{$ip} <= 999) {
      printf("%15s : %s\n", $ip, $by_ips{$ip});

   } else {
      printf("\n%15s : %s\n", $ip, $by_ips{$ip});

      %tmp = %{$by_type{$ip}};

      for (keys %tmp) {
         printf("%15s : %s\n", $_, $tmp{$_});
      }
   }
}

sub Print_Line {
   my($char) = shift;
   if ($char) {
      print "$char" x 80 . "\n";
   } else {
      print '*' x 80 . "\n";
   }
}

sub process_it {
   my($rec_line,$from,$tmp,$count,$line,$type,$tmp_c,$total_c,$unknown
+, @in,@data,@line);

   @in           = @_;
   ($rec_line)   = grep(/received from internet:/, @in);
   ($from, $tmp) = ( split(/:/, $rec_line) )[3, -1];

   $from =~ s/(fromhost=|$CHARS)//g;
# 
# We convert the @ signs to @ signs, and get back how many times it ha
+ppened in the string
# in question. This way we know how many people the email went to.
#
   $count = $tmp =~ s/@/@/g;
#
# Grab our data set now, Sometimes we have data sets with a single Err
+or-Handler line,
# other times we have error-handler, and an actual breakdown of what h
+appened to the message
# ie deferred, dropped, bounced, etc.. so if its only one line we leav
+e it alone, else we 
# whack the error-handler lines
# 
   (@data) = grep(!/received from internet:/, @in);
   (@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, 
+@data) && $data[1] );

   foreach $line (@data) {
      @line = split(/:/, $line);

      if ($line[0] =~ /( |-)([a-zA-Z]+)$/) {
         $type = $states{$2};
         warn "NO type: $line[0]\n" if ($type =~ /^(\s+|)$/);

         if ($type !~ /del_rem/) {
            $tmp_c++;

         } else {
# 
# this is the same as the above, but we are counting how many people w
+e sent to outside of 
# our mail system
#
            $tmp_c += $line[$#line] =~ s/@/@/g;
         }
      }

      $by_type{$from}{$type} += $tmp_c;
      $total_c               += $tmp_c;
      undef($tmp_c);
   }

   $by_ips{$from}           += $total_c;
}
[download]

/* And the Creator, against his better judgement, wrote man.c */

Edit by tye, put CODE tags around extremely long "words", add READMORE

Comment on The need for speed Select or Download Code

Replies are listed 'Best First'.
Re: The need for speed by BrowserUk (Patriarch) on Jan 24, 2003 at 00:30 UTC
Just looking at your `sub process _it` `my($rec_line,$from,$tmp,$count,$line,$type,$tmp_c,$total_c,$unknown, @in,@data,@line);` It's nice to see you using my. but it would be so much more effective if you used it at the point/scope at which the variable is used. It would save you having to do stuff like `undef($tmp_c);` `$unknown` is an appropriate name :). It appears nowhere else in the entire file. `@in = @_;` You make a local copy of the @_ here, then a few lines further on, you reduce @in using grep into @data and never use @in again. Better to reduce memory and cycles by supplying @_ directly to the grep. `$count = $tmp =~ s/@/@/g;` A quick test shows that `$count = $tmp =~ tr/@/@/;` is about 4 times faster at counting occurances of a single char. `(@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @data) && $data[1] );` This is .. er,.. strange. If the idea is that `...&& $data[1] )` will mean the greping will only happen if the array has more than one line, too late! You already grep'd the array in the first part of the conditional. In fact, you will only get to check if there is more than one line in the array if the grep has already discovered a line containing /Error-Handler/, which presumably means the was more than one line? Further, if the reason for the conditional is to prevent wasted cycles by greping to remove these lines unless they exists, again too late. You already grep'd to check. Ultimately, if there are no /Error-Handler/ lines, executing the code to remove them will have no effect., but at least you will only have grep'd once not twice. `foreach $line (@data) { @line = split(/:/, $line);` [download] You take a line from an array called @data, and call it $line. split $line into an array called @line which then is referred to as $linen in the rest of the loop. Nothing wrong exactly, but mighty confusing. Reminds me of the MP sketch about the Bruce's ;). `warn "NO type: $line[0]\n" if ($type =~ /^(\s+\|)$/);` Why `/^(\s+\|)$/)` instead of `/^\s$/`? I vaguely remember something about (:\s+\|) as an optimisation, but the fact that you were forcing the spaces or nothing to be captured and then discarded must surely be costing you more than you were saving? That said, the value you are checking, $type, comes from the previous line `$type = $states{$2};` . Now if $2 is not a key of `%states,` then that will return `undef` so that line would me much better written as `warn "NO type: $line[0]\n" if not defined $type;` or even `warn "..." unless $type;<code> <p>which are both more efficient and clearer. <p>However, you then go on to use $type as a hash key in <p><code> $by_type{$from}{$type} += $tmp_c;` [download] Which if you had `-w` or `use warnings;` in your code, would result in an runtime warning of something like `Use of uninitialized value in hash element at line nnn` and would remove the need for issuing your own. It is also possibly a more acurate reason for the note in your code: `print <<EOF; Note: Due to the methods employed to log the mta data, the breakdown by typ +e of mail will not necessarily add up to the total messages sent from an IP. The total i +s accurate and the breakdown may be short data. EOF` [download] I believe that the sub could be reduced to somthing like the following at least. This is completely untested, and changed by (quick) inspection only , but I think it should be close and run quite a bit quicker. As it is called many times at the heart of your application, it might make a substantial difference to your run times. I'd like to here the results. I think you could also consolidate at least 1 if not both of the temporary count vars `$tmp_c & $total_c`, but I couldn't wrap my head around the logic. sub process_it { my ($rec_line) = grep(/received from internet:/, @_); my ($from, $tmp) = ( split(/:/, $rec_line) )[3, -1]; $from =~ s/(fromhost=\|$CHARS)//g; my $count = $tmp =~ s/@/@/g; my @data = grep(!/received from internet:/ and !/Error-Handler/, @ +_); my $total_c; foreach my $line (@data) { my @line = split(/:/, $line); my ($tmp_c, $type); if ($line[0] =~ /( \|-)([a-zA-Z]+)$/) { $type = $states{$2}; warn "NO type: $line[0]\n" if ($type =~ /^(\s+\|)$/); $tmp_c += ($type =~ /del_rem/) ? $line[$#line] =~ tr/@/@/ +: 1; } $by_type{$from}{$type} += $tmp_c; $total_c += $tmp_c; } $by_ips{$from}+= $total_c; } [download] A quick scan of the rest of the code suggests that some of these changes could be beneficial there too, but that's left as AEFTP. Hope it helps some. Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.*	[reply] [d/l] [select]
Re: The need for speed by sauoq (Abbot) on Jan 23, 2003 at 23:33 UTC
There are a couple things you can do to speed up your code a bit. I don't know if they'll give you the kind of improvements you are looking for though. You didn't mention what hardware you got those runtimes on, by the way. For heavy log processing, it often makes sense to use a dedicated host or hosts and this might be a problem worth throwing some hardware at as you need it in the future. Analyzing 6 hours worth of data in 4 hours still allows you to finish well ahead of starting the next batch so why isn't it quick enough? What are your requirements? Following are some optimizations. The first thing I see is that you use s/// to count characters. You should use tr/// for that instead. `my $count = $string =~ tr/@//;` [download] I don't know how much it will help in your case but using alternation when a character class will do is generally frowned upon. Also, you might as well use precompiled expressions when you can. I'd change `$CHARS = '(<\|>\|\[\|\])';` [download] to `$CHARS = qr/([][<>])/;` [download] In your first loop, why do you perform a substitution on $id after matching it with a regular expression in the first place? I would replace `$id = $1 if (/msgid=([^:]+):/); $id =~ s/(^<\|>$)//g;` [download] with `($id) = /msgid=<([^>]+)/;` [download] It isn't exactly the same but I think it would work based on your sample data. If the msgid isn't always bracketed by '<' and '>' then it will break. It will also break if a '>' is permitted in a msgid. One way or the other though, there is bound to be a way to express what you need in one regex. This may be more of a style issue, but in that first loop, you also have `if (!$id \|\| $id =~ /^(\s+\|)$/) { $no_id++; next; }` [download] Why not just say `$no_id++ and next unless $id =~ /\S/;`? For that matter, you start it with `next if (!/MsgTrace/);` which would be better expressed with `next unless /MsgTrace/;` because it avoid the double negative. In your next loop, you make a copy of your data each time. That's a waste especially since you go to lengths not to modify your copy and you throw both the copy and the original out when you are done. (In fact, you probably shouldn't go to the trouble throwing the original out, by the way. You aren't doing what you think anyway.) Just work with the original. That's bound to save you some time. Also, consider passing a reference to process_it() rather than the array itself. Also, in that function, each of those greps is a loop. Try to combine them. Instead of `(@data) = grep(!/received from internet:/, @in); (@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @data) && $data[1] );` [download] try something like (untested): `if ($data[1]) { # Do you really mean that, by the way? my (@t1,@t2); @t1 = map { if (/Error-Handler/) { $_ } else { !/received from internet:/ and push @t2, $_; (); } } @data; @data = @t2 if @t1; }` [download] There's probably a better way to do that but that's one way you might reduce those three loops (greps) to one (map). Another micro-optimisation: Don't capture things you don't use in regexen: `if ($line[0] =~ /( \|-)([a-zA-Z]+)$/) {` [download] would be better written as `if ($line[0] =~ /[ -](a-zA-Z]+)$/) {` [download] instead. I imagine you can clean it up in other ways too. The most dramatic changes to performance usually come from algorithmic changes and I don't know enough about your problem to really think about a better way to do it. Eliminating loops within other loops is one rule of thumb that should help you a great deal as long as you remember that greps and maps are loops too. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
Re^2: The need for speed by Aristotle (Chancellor) on Jan 23, 2003 at 23:50 UTC
`($id) = /msgid=<([^>]+)/;` [download] In this particular case I think the semantics are duplicated exactly by `($id) = /msgid=<?(.+?)>?:/;` [download] Though it is likely he really wants `($id) = /msgid=<(.+?)>:/;` [download] Makeshifts last the longest.	[reply] [d/l] [select]
Re: Re: The need for speed by l2kashe (Deacon) on Jan 24, 2003 at 05:56 UTC
Thanks to all about the tr/// as opposed to s///, I hadn't come up with decent way to deal with how many recipients a message had, and I remembered a post here that used =~ s/// to count I think it was dots. So I used it cause it fit On the $id thing. yes the appropriate regex is `$id = $1 if (/msgid=<[^>]+>/);` [download] as the chars <> will never be valid chars in the ID. The reason for the multiple greps. I have a data set, lets say in instance 1 its 3 lines, in instance 2 is 2 lines. In instance 1 we have 1 received mail line 1 Error-Handler line 1 bounced line In instance 2 we have 1 received line 1 Error-Handler line What I am attempting to deal with is those disparate lines. So I take the received line from the data set. Then if there is more than 1 item left in the data set, and Error-Handler lines are also in the data set I remove them. If there aren't then I leave it alone cause I need to count that the message came through, it was just processed outside the scope of the log file itself. On the whole conditional against $id, it really is just style. If I am only doing 1 thing based on truth, I inline it, if not I use the braces. Thanks for the pointer about not capturing if Im not using it. Makes sense And as for your map usage, I am still a map newbie. Though I will definately see what I can get out of it in terms of mileage. Thanks for the input. oh yeah.. in terms of tossing the data away. If I dont I run out of memory. I need to only process one file, extract the relevant data, then I need to clean my %data out or I simply dont have any memory left. As you can see from the data samples the lines are long. It amazing how your paradigm shifts along with the size of your data set :P /* And the Creator, against his better judgement, wrote man.c */	[reply] [d/l]
Re: Re: Re: The need for speed by sauoq (Abbot) on Jan 24, 2003 at 07:57 UTC
The reason for the multiple greps. The reason is irrelevant. The point is only that you are looping through data multiple times. If you want it to be fast, do your work on one pass through the data if it is at all possible. In your case, it is possible. You could even roll at least some of the work into your following foreach loop. In your process_it() sub you loop through the data a minimum of four(!!!) times. Five sometimes. Lots of loops and speed don't mix. On the whole conditional against $id, it really is just style. If I am only doing 1 thing based on truth, I inline it, if not I use the braces. Well, some of it is a matter of style. I threw you off by using unless in that manner. The code `if (!$id \|\| $id =~ /^(\s+\|)$/)` [download] isn't particularly good for a couple reasons. First, the `!$id` is not expressing what you are trying to say. Granted, you probably aren't going to have a message ID which evaluates to '0' but if you did... oops. Also, it probably isn't optimizing anything. If you have spaces, you have to check both clauses. Using `if (/\S/)` is much clearer and might well be a little more efficient in the long run too. oh yeah.. in terms of tossing the data away. If I dont I run out of memory. I need to only process one file, extract the relevant data, then I need to clean my %data out or I simply dont have any memory left. Declare `%data` with `my` just inside your `foreach $file` loop. Don't undef it one key at a time. It amazing how your paradigm shifts along with the size of your data set :P Oh, I don't know... I've dealt with biggish datasets in the tens and hundreds of gigs. It's true there are some unique logistical considerations and some shortcuts you can't take but the basics of writing efficient code stay the same. The real "paradigm shift" comes when you have to break your problem down such that it can be distributed to multiple machines. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
Re: The need for speed by Elian (Parson) on Jan 23, 2003 at 22:20 UTC
Given what you need to do, I think you may find you're better off separating the log parsing and data analysis phases. Have one program who's sole function is to parse the log files and dump the results into a database, and a second to process the data in that database. (You could alter the system to log directly to a DB rather than a file, but I wouldn't recommend that for reliability reasons) Doing this has the following advantages: You can offload the processing to a separate machine or machines You have access to a wider range of data without having to parse the log files multiple times You can write a variety of reports much more easily You can separate the DB server and report machines as well if you're finding you have bottlenecks there. It'll make your life a lot easier in the long run, and give you access to more sophisticated data analysis tools. Setting up MySQL or Postgres isn't that big a deal--I'd really recommend you give it a shot this way.	[reply]
Re^2: The need for speed by Aristotle (Chancellor) on Jan 23, 2003 at 22:47 UTC
And you don't even need to set up Postgres, MySQL or any other real database server - DBD::SQLite is an excellent self contained driver and database, and it's very fast too. See its Perl advent calendar entry for a quick intro. People have reported great success at using it for their log mangling needs. Makeshifts last the longest.	[reply]
Re: The need for speed by talexb (Chancellor) on Jan 23, 2003 at 22:05 UTC
Let me deflect discussion on changing your code and ask the following instead: How do you know that your machine is running close to 100%? Could you get a database to take care of storing, sorting and selecting this data? You might also be able to speed things up by going to a faster machine, one with more RAM or even by spreading the jobs over multiple machines. --t. alex Life is short: get busy!	[reply]
Re: Re: The need for speed by l2kashe (Deacon) on Jan 23, 2003 at 22:21 UTC
My machine is not maxed out. It is a dual proc system, with its filesystem being a hardware raid 0, and plenty of RAM. Theoretically I could use a DB, but then that raises the scenario of, read the data, sort the data into relevant chunks, now do I A) send the data to a DB of some sort, to later reopen, reread into memory what ever way, and then duly process it, or B) simply process it now. Also if I were to segregate the computation of the data across hosts, I'm still dealing with finding a sane way of splitting the data, sending it out to some other host, polling to see when they are done, or waiting for them to finish processing, and then pulling all the data back together and correlating it. All of which adds to the run time. I have also realized that not all the events for a msgid will occur within a given logfile, due to rotation, but im not even gonna go into maintaining state across files, for a very small percentage of the actual data. As you can see in the code I'v been allowed to kinda punt in regards to absolute accuracy, even though it may drive my inner anal-retentive geek crazy, such is life I guess I should step back and say, I'm not attacking your points, so much as stating I've already considered it and didn't think the benefits out weighed the returns in terms of code logic and run time. No offense intended in any way shape or form. I should also add that the system doing the processing itself is not one of the MTAs. It is a completly seperate host, which is doing just about absolutely nothing aside from sshd. Its also a FreeBSD box if it makes any difference /* And the Creator, against his better judgement, wrote man.c */	[reply]
Re: Re: Re: The need for speed by Elian (Parson) on Jan 23, 2003 at 22:31 UTC
Having done processing of flat files in the way you have to, I can tell you that it's far, far easier, and often faster, to use a DB. Throughput is normally better, and having the data queryable in an easy way gives you the opportunity to analyze it in ways you may not have thought of, or though of but discarded because it was infeasable because of processing concerns.	[reply]
Re (3): The need for speed by talexb (Chancellor) on Jan 23, 2003 at 22:35 UTC
OK, well I'd suggest you try running all of the processes in paralell -- to get an idea of how much 'spare time' you have. You know, process data in one process while some of the others are reading through the data files. Then maybe try running just three at a time, and so forth. --t. alex Life is short: get busy!	[reply]
Re: The need for speed by grinder (Bishop) on Jan 23, 2003 at 23:50 UTC
First off all, kudos to you for providing some data and especially having taken the time to sanitise it. ++ for that alone. So, you're sorting 1 000 000 records in 30 seconds? That isn't too shabby, you know? Especially given their length. Depending on how much RAM your machine has, you may be paging out to disk. As I mentioned in a response to a similar question at Fast/Efficient Sort for Large Files, for that kind of volume you want to evaluate replacing Perl's sort by a call to the external sort command. Sorting textual IP addresses is non-trivial: you'll have to write a preprecessor that takes the log, isolates the IP address, packs it with `inet_aton` and prepends it to the line and writes it out. You can then sort the file with no special command line switches. Consider it a meta-Guttman-Rosler Transform if you will. You've switched off file buffering with `$\| = 1`. You ought to remove that line and see if it makes a difference. Other than that, there's nothing really glaring that you seem to be doing wrong. Not too sure about your choice of variable names (e.g. `$tmp_c`), it's hard to guess its purpose. I have a hard time fathoming the purpose of all those grep chains. I'm not sure if the records lend themselves to the following approach, but if a record can only be counted in one single way, you could write a splitter filter that takes the one input file, opens up as many output files as there are categories, and writes the record to the correct category file. Then you write a series of filters to deal with the separate files. There are a number of advantages to this approach. All the consistency checking goes in the splitter. The downstream filters need less error checking in them as they're dealing with a restricted range of records. If you find a bug in one category, you only have to fix its reporting script and rerun it, rather than running the whole batch. And smaller files put less of a strain on the system, you might pick up a few seconds here and there. Categories might be number of recipients / inbound / outbound / garbage. Note that depending on which dimensions you choose, a record could be written to more than one category file. E.g. a message sent to both internal and external recipients. When you split out to different files, you should strip out all extraneous data you don't want to play with. This means that in the reporting scripts you won't have as large a dataset flowing across your bus. Less I/O will improve your score. print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'	[reply] [d/l] [select]
Re: The need for speed by traveler (Parson) on Jan 23, 2003 at 22:07 UTC
Abesnt a lot of real data it is hard to test. I do, however have a few suggestions: Try to profile the code, e.g. using Devel::DProf to see where the time is used. Consider the loop: `foreach $ip ( sort { $by_ips{$a} <=> $by_ips{$b} } keys %by_ips ) { ... }` [download] You might save time dividing it up into the three lists of <499, <=599, >599 first if the list is large and the sort slow. `(@data) = grep(!/received from internet:/, @in); (@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @da +ta) && +$data[1] );` [download] The second assignment appears to override the first... HTH, --traveler	[reply] [d/l] [select]
Re: Re: The need for speed by l2kashe (Deacon) on Jan 23, 2003 at 22:48 UTC
*1. Try to profile the code, e.g. using Devel::DProf to see where the time is used.* This is sound advice.. I actually made use of extensive prints and such, correlating how much time was spent where. Actually reading the data in and then sorting it into data sets like I showed above requires about 30 seconds per million lines. From there, each array (anywhere from 2 elements to as many as 100 elements) is processed. Each array itself that I've processed hasnt taken longer than 1 second, even for those larger arrays. Its just the sheer number of arrays to be dealt with. so out of say a runtime of 275 seconds, 240 seconds is spent processing the actual arrays and printing it. *2. Consider the loop:* `foreach $ip ( sort { $by_ips{$a} <=> $by_ips{$b} } keys %by_ips ) { ... }` [download] *You might save time dividing it up into the three lists of <499, <=599, >599 first if the list is large and the sort slow.* You right, I guess I could, and then only deal with a certain space. But the issue isnt just one log. That loop is the end of the run. It is the culmination of 4 logs / server times 11 servers. So if lets say joe at some ip logs 5 messages on mta1, 20 messages on mta2, 450 message on mta6 and 1200 messages on mta9 (say due to load balancing, then joe's total messages is 1675 messages. But during processing each of those values are in seperate scalars, and then get merged at the end of the sub routine into the global hashs. So when does the segregation happen? Would it be better to segregate data per host into its own hash? But then I need to merge all that data in order to determine if it matches the <= 499, in which case it doesnt get printed at all. If its 500 - 999 then it only shows IP and # of messages, <= 1000 we now need to breakdown what happened to all that mail. I guess I could test along the way to see if a value has exceeded X and then move it to a new data structure, and if its exceeded Y then move to yet another structure. Then loop struct X to simply print, and struct Y to print and break down.. That is a thought.. hrm.. I don't know if its faster to loop 2 smaller structures than 1 larger ones, but its a thought for sure `(@data) = grep(!/received from internet:/, @in); (@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @da +ta) && +$data[1] );` [download] *The second assignment appears to override the first...* It does. I run into situations where I will have a server msg stating a SMTP connection was opened and a message was received. Due to log rotations, and other misc "features" of the MTA code itself, I may get something like data set 3 which has a received and an Error-Handler line within it. Or I may get something like data set 5 which has a received, an Error-Handler, and a bounced line (By types of lines I mean Note;MsgTrace(num/num) blah: with the blah being the relevant data). Now the error-handler and bounced lines are talking about the same thing, but I cant determine what should be in a window prior to. I managed to figure out though that if data elem 1 exists, and if there are Error-Handler entries within the data, then they are superflous, I just couldnt think of a more elegant means to ignore them. I guess I could set some flag var, and test for it, and if it exists then to ignore Error lines. But it again raises my question of speed. Which is faster, testing for a second elem, and whacking Error lines from data set, or testing of elem, setting flag, and then testing that within a tighter loop. Which is better/faster? Thanks for the feedback.. Ill think about the seperate structures on my ride home. :) /* And the Creator, against his better judgement, wrote man.c */	[reply] [d/l] [select]
Re: The need for speed by l2kashe (Deacon) on Jan 24, 2003 at 16:47 UTC
Just wanted to take a second to say thanks. I mean the reason the seeker section is here to to help and be helped, but I always find perlmonks to be a breath of fresh air in terms on the online community. With the help above I managed to get my sort of data into relvant sets down from around 35-40 seconds to around 18-22 seconds. Which isnt too shabby for about 1 million lines per file Also I managed to go from >= 240 seconds total processing per file to <= 200 seconds per file. Which isnt huge, but it adds up, and some files were taking over 400 seconds to process so there it really helped. Just wanted to say thanks :) If anyone expresses intrest I can always slap the new code up on my scratchpad, just msg me. /* And the Creator, against his better judgement, wrote man.c */	[reply]
Re: The need for speed by rir (Vicar) on Jan 24, 2003 at 01:29 UTC
A couple of small optimizations, the real issues seem to be covered already. Any busy fixed size hashes like `%states` may be converted to arrays: `use constant bounced => 0; use constant deferred => 1; use constant directory => 2; # et cetera my @states;` [download] Regexes like `next if (!/MsgTrace/);` should be faster if anchored, I'm just guessing at the values here `next if ( !/^.{52,55}MsgTrace/);`	[reply] [d/l] [select]
Re: Re: The need for speed by cees (Curate) on Jan 25, 2003 at 05:35 UTC
Since the regex is just looking for a fixed string, wouldn't it be faster to just use `index`? `next if index($_, 'MsgTrace') == -1;` [download] I like your anchoring idea though. I'll have to remember that one.	[reply] [d/l] [select]