Re: The need for speed

Abesnt a lot of real data it is hard to test. I do, however have a few suggestions:

Try to profile the code, e.g. using Devel::DProf to see where the time is used.
Consider the loop:
```
foreach $ip ( sort { $by_ips{$a} <=> $by_ips{$b} } keys %by_ips ) {
...
}
[download]
```
You might save time dividing it up into the three lists of <499, <=599, >599 first if the list is large and the sort slow.

(@data) = grep(!/received from internet:/, @in);
(@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @da
+ta) && 
+$data[1] );
[download]

The second assignment appears to override the first...

HTH, --traveler

Comment on Re: The need for speed Select or Download Code

Replies are listed 'Best First'.
Re: Re: The need for speed by l2kashe (Deacon) on Jan 23, 2003 at 22:48 UTC
*1. Try to profile the code, e.g. using Devel::DProf to see where the time is used.* This is sound advice.. I actually made use of extensive prints and such, correlating how much time was spent where. Actually reading the data in and then sorting it into data sets like I showed above requires about 30 seconds per million lines. From there, each array (anywhere from 2 elements to as many as 100 elements) is processed. Each array itself that I've processed hasnt taken longer than 1 second, even for those larger arrays. Its just the sheer number of arrays to be dealt with. so out of say a runtime of 275 seconds, 240 seconds is spent processing the actual arrays and printing it. *2. Consider the loop:* `foreach $ip ( sort { $by_ips{$a} <=> $by_ips{$b} } keys %by_ips ) { ... }` [download] *You might save time dividing it up into the three lists of <499, <=599, >599 first if the list is large and the sort slow.* You right, I guess I could, and then only deal with a certain space. But the issue isnt just one log. That loop is the end of the run. It is the culmination of 4 logs / server times 11 servers. So if lets say joe at some ip logs 5 messages on mta1, 20 messages on mta2, 450 message on mta6 and 1200 messages on mta9 (say due to load balancing, then joe's total messages is 1675 messages. But during processing each of those values are in seperate scalars, and then get merged at the end of the sub routine into the global hashs. So when does the segregation happen? Would it be better to segregate data per host into its own hash? But then I need to merge all that data in order to determine if it matches the <= 499, in which case it doesnt get printed at all. If its 500 - 999 then it only shows IP and # of messages, <= 1000 we now need to breakdown what happened to all that mail. I guess I could test along the way to see if a value has exceeded X and then move it to a new data structure, and if its exceeded Y then move to yet another structure. Then loop struct X to simply print, and struct Y to print and break down.. That is a thought.. hrm.. I don't know if its faster to loop 2 smaller structures than 1 larger ones, but its a thought for sure `(@data) = grep(!/received from internet:/, @in); (@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @da +ta) && +$data[1] );` [download] *The second assignment appears to override the first...* It does. I run into situations where I will have a server msg stating a SMTP connection was opened and a message was received. Due to log rotations, and other misc "features" of the MTA code itself, I may get something like data set 3 which has a received and an Error-Handler line within it. Or I may get something like data set 5 which has a received, an Error-Handler, and a bounced line (By types of lines I mean Note;MsgTrace(num/num) blah: with the blah being the relevant data). Now the error-handler and bounced lines are talking about the same thing, but I cant determine what should be in a window prior to. I managed to figure out though that if data elem 1 exists, and if there are Error-Handler entries within the data, then they are superflous, I just couldnt think of a more elegant means to ignore them. I guess I could set some flag var, and test for it, and if it exists then to ignore Error lines. But it again raises my question of speed. Which is faster, testing for a second elem, and whacking Error lines from data set, or testing of elem, setting flag, and then testing that within a tighter loop. Which is better/faster? Thanks for the feedback.. Ill think about the seperate structures on my ride home. :) /* And the Creator, against his better judgement, wrote man.c */	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: The need for speed
by l2kashe (Deacon) on Jan 23, 2003 at 22:48 UTC

1. Try to profile the code, e.g. using Devel::DProf to see where the time is used.

2. Consider the loop:

foreach $ip ( sort { $by_ips{$a} <=> $by_ips{$b} } keys %by_ips ) {
...
}
[download]

You might save time dividing it up into the three lists of <499, <=599, >599 first if the list is large and the sort slow.

(@data) = grep(!/received from internet:/, @in);
(@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @da
+ta) && 
+$data[1] );
[download]

The second assignment appears to override the first...

[reply]
[d/l]
[select]