Re: The need for speed
by BrowserUk (Patriarch) on Jan 24, 2003 at 00:30 UTC
|
Just looking at your sub process _it
- my($rec_line,$from,$tmp,$count,$line,$type,$tmp_c,$total_c,$unknown, @in,@data,@line);
It's nice to see you using my. but it would be so much more effective if you used it at the point/scope at which the variable is used. It would save you having to do stuff like undef($tmp_c);
$unknown is an appropriate name :). It appears nowhere else in the entire file.
- @in = @_;
You make a local copy of the @_ here, then a few lines further on, you reduce @in using grep into @data and never use @in again.
Better to reduce memory and cycles by supplying @_ directly to the grep.
- $count = $tmp =~ s/@/@/g;
A quick test shows that $count = $tmp =~ tr/@/@/; is about 4 times faster at counting occurances of a single char.
- (@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @data) && $data[1] );
This is .. er,.. strange. If the idea is that ...&& $data[1] ) will mean the greping will only happen if the array has more than one line, too late! You already grep'd the array in the first part of the conditional. In fact, you will only get to check if there is more than one line in the array if the grep has already discovered a line containing /Error-Handler/, which presumably means the was more than one line?
Further, if the reason for the conditional is to prevent wasted cycles by greping to remove these lines unless they exists, again too late. You already grep'd to check. Ultimately, if there are no /Error-Handler/ lines, executing the code to remove them will have no effect., but at least you will only have grep'd once not twice.
-
foreach $line (@data) {
@line = split(/:/, $line);
You take a line from an array called @data, and call it $line. split $line into an array called @line which then is referred to as $linen in the rest of the loop. Nothing wrong exactly, but mighty confusing.
Reminds me of the MP sketch about the Bruce's ;).
- warn "NO type: $line[0]\n" if ($type =~ /^(\s+|)$/);
Why /^(\s+|)$/) instead of /^\s*$/? I vaguely remember something about (:\s+|) as an optimisation, but the fact that you were forcing the spaces or nothing to be captured and then discarded must surely be costing you more than you were saving?
That said, the value you are checking, $type, comes from the previous line $type = $states{$2}; .
Now if $2 is not a key of %states, then that will return undef so that line would me much better written as
warn "NO type: $line[0]\n" if not defined $type;
or even
warn "..." unless $type;<code>
<p>which are both more efficient and clearer.
<p>However, you then go on to use $type as a hash key in
<p><code> $by_type{$from}{$type} += $tmp_c;
Which if you had -w or use warnings; in your code, would result in an runtime warning of something like
Use of uninitialized value in hash element at line nnn and would remove the need for issuing your own. It is also possibly a more acurate reason for the note in your code:
print <<EOF;
Note:
Due to the methods employed to log the mta data, the breakdown by typ
+e of mail will not
necessarily add up to the total messages sent from an IP. The total i
+s accurate and the
breakdown may be short data.
EOF
I believe that the sub could be reduced to somthing like the following at least. This is completely untested, and changed by (quick) inspection only , but I think it should be close and run quite a bit quicker. As it is called many times at the heart of your application, it might make a substantial difference to your run times. I'd like to here the results.
I think you could also consolidate at least 1 if not both of the temporary count vars $tmp_c & $total_c, but I couldn't wrap my head around the logic.
sub process_it {
my ($rec_line) = grep(/received from internet:/, @_);
my ($from, $tmp) = ( split(/:/, $rec_line) )[3, -1];
$from =~ s/(fromhost=|$CHARS)//g;
my $count = $tmp =~ s/@/@/g;
my @data = grep(!/received from internet:/ and !/Error-Handler/, @
+_);
my $total_c;
foreach my $line (@data) {
my @line = split(/:/, $line);
my ($tmp_c, $type);
if ($line[0] =~ /( |-)([a-zA-Z]+)$/) {
$type = $states{$2};
warn "NO type: $line[0]\n" if ($type =~ /^(\s+|)$/);
$tmp_c += ($type =~ /del_rem/) ? $line[$#line] =~ tr/@/@/
+: 1;
}
$by_type{$from}{$type} += $tmp_c;
$total_c += $tmp_c;
}
$by_ips{$from}+= $total_c;
}
A quick scan of the rest of the code suggests that some of these changes could be beneficial there too, but that's left as AEFTP.
Hope it helps some.
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] [select] |
Re: The need for speed
by sauoq (Abbot) on Jan 23, 2003 at 23:33 UTC
|
There are a couple things you can do to speed up your code a bit. I don't know if they'll give you the kind of improvements you are looking for though. You didn't mention what hardware you got those runtimes on, by the way. For heavy log processing, it often makes sense to use a dedicated host or hosts and this might be a problem worth throwing some hardware at as you need it in the future. Analyzing 6 hours worth of data in 4 hours still allows you to finish well ahead of starting the next batch so why isn't it quick enough? What are your requirements?
Following are some optimizations.
The first thing I see is that you use s/// to count characters. You should use tr/// for that instead.
my $count = $string =~ tr/@//;
I don't know how much it will help in your case but using alternation when a character class will do is generally frowned upon. Also, you might as well use precompiled expressions when you can. I'd change
$CHARS = '(<|>|\[|\])';
to
$CHARS = qr/([][<>])/;
In your first loop, why do you perform a substitution on $id after matching it with a regular expression in the first place? I would replace
$id = $1 if (/msgid=([^:]+):/);
$id =~ s/(^<|>$)//g;
with
($id) = /msgid=<([^>]+)/;
It isn't exactly the same but I think it would work based on your sample data. If the msgid isn't always bracketed by '<' and '>' then it will break. It will also break if a '>' is permitted in a msgid. One way or the other though, there is bound to be a way to express what you need in one regex.
This may be more of a style issue, but in that first loop, you also have
if (!$id || $id =~ /^(\s+|)$/) {
$no_id++;
next;
}
Why not just say
$no_id++ and next unless $id =~ /\S/;? For that matter, you start it with next if (!/MsgTrace/); which would be better expressed with next unless /MsgTrace/; because it avoid the double negative.
In your next loop, you make a copy of your data each time. That's a waste especially since you go to lengths not to modify your copy and you throw both the copy and the original out when you are done. (In fact, you probably shouldn't go to the trouble throwing the original out, by the way. You aren't doing what you think anyway.) Just work with the original. That's bound to save you some time. Also, consider passing a reference to process_it() rather than the array itself. Also, in that function, each of those greps is a loop. Try to combine them. Instead of
(@data) = grep(!/received from internet:/, @in);
(@data) = grep(!/Error-Handler/, @data)
if ( grep(/Error-Handler/, @data) && $data[1] );
try something like (untested):
if ($data[1]) { # Do you really mean that, by the way?
my (@t1,@t2);
@t1 = map {
if (/Error-Handler/) {
$_
} else {
!/received from internet:/ and push @t2, $_;
();
}
} @data;
@data = @t2 if @t1;
}
There's probably a better way to do that but that's one way you might reduce those three loops (greps) to one (map).
Another micro-optimisation: Don't capture things you don't use in regexen:
if ($line[0] =~ /( |-)([a-zA-Z]+)$/) {
would be better written as
if ($line[0] =~ /[ -](a-zA-Z]+)$/) {
instead.
I imagine you can clean it up in other ways too. The most dramatic changes to performance usually come from algorithmic changes and I don't know enough about your problem to really think about a better way to do it. Eliminating loops within other loops is one rule of thumb that should help you a great deal as long as you remember that greps and maps are loops too.
-sauoq
"My two cents aren't worth a dime.";
| [reply] [d/l] [select] |
|
|
($id) = /msgid=<([^>]+)/;
In this particular case I think the semantics are duplicated exactly by
($id) = /msgid=<?(.+?)>?:/;
Though it is likely he really wants
($id) = /msgid=<(.+?)>:/;
Makeshifts last the longest. | [reply] [d/l] [select] |
|
|
Thanks to all about the tr/// as opposed to s///, I hadn't come up with decent way to deal with how many recipients a message had, and I remembered a post here that used =~ s/// to count I think it was dots. So I used it cause it fit
On the $id thing. yes the appropriate regex is
$id = $1 if (/msgid=<[^>]+>/);
as the chars <> will never be valid chars in the ID.
The reason for the multiple greps. I have a data set, lets say in instance 1 its 3 lines, in instance 2 is 2 lines. In instance 1 we have
1 received mail line
1 Error-Handler line
1 bounced line
In instance 2 we have
1 received line
1 Error-Handler line
What I am attempting to deal with is those disparate lines. So I take the received line from the data set. Then if there is more than 1 item left in the data set, and Error-Handler lines are also in the data set I remove them. If there aren't then I leave it alone cause I need to count that the message came through, it was just processed outside the scope of the log file itself.
On the whole conditional against $id, it really is just style. If I am only doing 1 thing based on truth, I inline it, if not I use the braces.
Thanks for the pointer about not capturing if Im not using it. Makes sense
And as for your map usage, I am still a map newbie. Though I will definately see what I can get out of it in terms of mileage. Thanks for the input. oh yeah.. in terms of tossing the data away. If I dont I run out of memory. I need to only process one file, extract the relevant data, then I need to clean my %data out or I simply dont have any memory left. As you can see from the data samples the lines are long. It amazing how your paradigm shifts along with the size of your data set :P
/* And the Creator, against his better judgement, wrote man.c */ | [reply] [d/l] |
|
|
The reason for the multiple greps.
The reason is irrelevant. The point is only that you are looping through data multiple times. If you want it to be fast, do your work on one pass through the data if it is at all possible. In your case, it is possible. You could even roll at least some of the work into your following foreach loop. In your process_it() sub you loop through the data a minimum of four(!!!) times. Five sometimes. Lots of loops and speed don't mix.
On the whole conditional against $id, it really is just style. If I am only doing 1 thing based on truth, I inline it, if not I use the braces.
Well, some of it is a matter of style. I threw you off by using unless in that manner. The code
if (!$id || $id =~ /^(\s+|)$/)
isn't particularly good for a couple reasons. First, the !$id is not expressing what you are trying to say. Granted, you probably aren't going to have a message ID which evaluates to '0' but if you did... oops. Also, it probably isn't optimizing anything. If you have spaces, you have to check both clauses. Using if (/\S/) is much clearer and might well be a little more efficient in the long run too.
oh yeah.. in terms of tossing the data away. If I dont I run out of memory. I need to only process one file, extract the relevant data, then I need to clean my %data out or I simply dont have any memory left.
Declare %data with my just inside your foreach $file loop. Don't undef it one key at a time.
It amazing how your paradigm shifts along with the size of your data set :P
Oh, I don't know... I've dealt with biggish datasets in the tens and hundreds of gigs. It's true there are some unique logistical considerations and some shortcuts you can't take but the basics of writing efficient code stay the same. The real "paradigm shift" comes when you have to break your problem down such that it can be distributed to multiple machines.
-sauoq
"My two cents aren't worth a dime.";
| [reply] [d/l] [select] |
Re: The need for speed
by Elian (Parson) on Jan 23, 2003 at 22:20 UTC
|
| [reply] |
|
|
| [reply] |
Re: The need for speed
by talexb (Chancellor) on Jan 23, 2003 at 22:05 UTC
|
Let me deflect discussion on changing your code and ask the following instead:
- How do you know that your machine is running close to 100%?
- Could you get a database to take care of storing, sorting and selecting this data?
You might also be able to speed things up by going to a faster machine, one with more RAM or even by spreading the jobs over multiple machines.
--t. alex
Life is short: get busy!
| [reply] |
|
|
My machine is not maxed out. It is a dual proc system, with its filesystem being a hardware raid 0, and plenty of RAM. Theoretically I could use a DB, but then that raises the scenario of, read the data, sort the data into relevant chunks, now do I A) send the data to a DB of some sort, to later reopen, reread into memory what ever way, and then duly process it, or B) simply process it now. Also if I were to segregate the computation of the data across hosts, I'm still dealing with finding a sane way of splitting the data, sending it out to some other host, polling to see when they are done, or waiting for them to finish processing, and then pulling all the data back together and correlating it. All of which adds to the run time.
I have also realized that not all the events for a msgid will occur within a given logfile, due to rotation, but im not even gonna go into maintaining state across files, for a very small percentage of the actual data. As you can see in the code I'v been allowed to kinda punt in regards to absolute accuracy, even though it may drive my inner anal-retentive geek crazy, such is life
I guess I should step back and say, I'm not attacking your points, so much as stating I've already considered it and didn't think the benefits out weighed the returns in terms of code logic and run time. No offense intended in any way shape or form. I should also add that the system doing the processing itself is not one of the MTAs. It is a completly seperate host, which is doing just about absolutely nothing aside from sshd. Its also a FreeBSD box if it makes any difference
/* And the Creator, against his better judgement, wrote man.c */
| [reply] |
|
|
Having done processing of flat files in the way you have to, I can tell you that it's far, far easier, and often faster, to use a DB. Throughput is normally better, and having the data queryable in an easy way gives you the opportunity to analyze it in ways you may not have thought of, or though of but discarded because it was infeasable because of processing concerns.
| [reply] |
|
|
OK, well I'd suggest you try running all of the processes in paralell -- to get an idea of how much 'spare time' you have. You know, process data in one process while some of the others are reading through the data files.
Then maybe try running just three at a time, and so forth.
--t. alex
Life is short: get busy!
| [reply] |
Re: The need for speed
by grinder (Bishop) on Jan 23, 2003 at 23:50 UTC
|
First off all, kudos to you for providing some data and especially having taken the time to sanitise it. ++ for that alone.
So, you're sorting 1 000 000 records in 30 seconds? That isn't too shabby, you know? Especially given their length. Depending on how much RAM your machine has, you may be paging out to disk. As I mentioned in a response to a similar question at Fast/Efficient Sort for Large Files, for that kind of volume you want to evaluate replacing Perl's sort by a call to the external sort command.
Sorting textual IP addresses is non-trivial: you'll have to write a preprecessor that takes the log, isolates the IP address, packs it with inet_aton and prepends it to the line and writes it out. You can then sort the file with no special command line switches. Consider it a meta-Guttman-Rosler Transform if you will.
You've switched off file buffering with $| = 1. You ought to remove that line and see if it makes a difference.
Other than that, there's nothing really glaring that you seem to be doing wrong. Not too sure about your choice of variable names (e.g. $tmp_c), it's hard to guess its purpose. I have a hard time fathoming the purpose of all those grep chains.
I'm not sure if the records lend themselves to the following approach, but if a record can only be counted in one single way, you could write a splitter filter that takes the one input file, opens up as many output files as there are categories, and writes the record to the correct category file. Then you write a series of filters to deal with the separate files.
There are a number of advantages to this approach. All the consistency checking goes in the splitter. The downstream filters need less error checking in them as they're dealing with a restricted range of records. If you find a bug in one category, you only have to fix its reporting script and rerun it, rather than running the whole batch. And smaller files put less of a strain on the system, you might pick up a few seconds here and there.
Categories might be number of recipients / inbound / outbound / garbage. Note that depending on which dimensions you choose, a record could be written to more than one category file. E.g. a message sent to both internal and external recipients.
When you split out to different files, you should strip out all extraneous data you don't want to play with. This means that in the reporting scripts you won't have as large a dataset flowing across your bus. Less I/O will improve your score.
print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'
| [reply] [d/l] [select] |
Re: The need for speed
by traveler (Parson) on Jan 23, 2003 at 22:07 UTC
|
Abesnt a lot of real data it is hard to test. I do, however have a few suggestions:
- Try to profile the code, e.g. using Devel::DProf to see where the time is used.
- Consider the loop:
foreach $ip ( sort { $by_ips{$a} <=> $by_ips{$b} } keys %by_ips ) {
...
}
You might save time dividing it up into the three lists of <499, <=599, >599 first if the list is large and the sort slow.
-
(@data) = grep(!/received from internet:/, @in);
(@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @da
+ta) &&
+$data[1] );
The second assignment appears to override the first...
HTH, --traveler | [reply] [d/l] [select] |
|
|
1. Try to profile the code, e.g. using Devel::DProf to see where the time is used.
This is sound advice.. I actually made use of extensive prints and such, correlating how much time was spent where. Actually reading the data in and then sorting it into data sets like I showed above requires about 30 seconds per million lines.
From there, each array (anywhere from 2 elements to as many as 100 elements) is processed. Each array itself that I've processed hasnt taken longer than 1 second, even for those larger arrays. Its just the sheer number of arrays to be dealt with. so out of say a runtime of 275 seconds, 240 seconds is spent processing the actual arrays and printing it.
2. Consider the loop:
foreach $ip ( sort { $by_ips{$a} <=> $by_ips{$b} } keys %by_ips ) {
...
}
You might save time dividing it up into the three lists of <499, <=599, >599 first if the list is large and the sort slow.
You right, I guess I could, and then only deal with a certain space. But the issue isnt just one log. That loop is the end of the run. It is the culmination of 4 logs / server times 11 servers. So if lets say joe at some ip logs 5 messages on mta1, 20 messages on mta2, 450 message on mta6 and 1200 messages on mta9 (say due to load balancing, then joe's total messages is 1675 messages. But during processing each of those values are in seperate scalars, and then get merged at the end of the sub routine into the global hashs. So when does the segregation happen? Would it be better to segregate data per host into its own hash? But then I need to merge all that data in order to determine if it matches the <= 499, in which case it doesnt get printed at all. If its 500 - 999 then it only shows IP and # of messages, <= 1000 we now need to breakdown what happened to all that mail. I guess I could test along the way to see if a value has exceeded X and then move it to a new data structure, and if its exceeded Y then move to yet another structure. Then loop struct X to simply print, and struct Y to print and break down.. That is a thought.. hrm.. I don't know if its faster to loop 2 smaller structures than 1 larger ones, but its a thought for sure
(@data) = grep(!/received from internet:/, @in);
(@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @da
+ta) &&
+$data[1] );
The second assignment appears to override the first...
It does. I run into situations where I will have a server msg stating a SMTP connection was opened and a message was received. Due to log rotations, and other misc "features" of the MTA code itself, I may get something like data set 3 which has a received and an Error-Handler line within it. Or I may get something like data set 5 which has a received, an Error-Handler, and a bounced line (By types of lines I mean Note;MsgTrace(num/num) blah: with the blah being the relevant data). Now the error-handler and bounced lines are talking about the same thing, but I cant determine what should be in a window prior to. I managed to figure out though that if data elem 1 exists, and if there are Error-Handler entries within the data, then they are superflous, I just couldnt think of a more elegant means to ignore them. I guess I could set some flag var, and test for it, and if it exists then to ignore Error lines. But it again raises my question of speed. Which is faster, testing for a second elem, and whacking Error lines from data set, or testing of elem, setting flag, and then testing that within a tighter loop. Which is better/faster?
Thanks for the feedback.. Ill think about the seperate structures on my ride home. :)
/* And the Creator, against his better judgement, wrote man.c */ | [reply] [d/l] [select] |
Re: The need for speed
by l2kashe (Deacon) on Jan 24, 2003 at 16:47 UTC
|
Just wanted to take a second to say thanks. I mean the reason the seeker section is here to to help and be helped, but I always find perlmonks to be a breath of fresh air in terms on the online community.
With the help above I managed to get my sort of data into relvant sets down from around 35-40 seconds to around 18-22 seconds. Which isnt too shabby for about 1 million lines per file
Also I managed to go from >= 240 seconds total processing per file to <= 200 seconds per file. Which isnt huge, but it adds up, and some files were taking over 400 seconds to process so there it really helped.
Just wanted to say thanks :)
If anyone expresses intrest I can always slap the new code up on my scratchpad, just msg me.
/* And the Creator, against his better judgement, wrote man.c */ | [reply] |
Re: The need for speed
by rir (Vicar) on Jan 24, 2003 at 01:29 UTC
|
use constant bounced => 0;
use constant deferred => 1;
use constant directory => 2; # et cetera
my @states;
Regexes like
next if (!/MsgTrace/);
should be faster if anchored, I'm just guessing at the
values here next if ( !/^.{52,55}MsgTrace/); | [reply] [d/l] [select] |
|
|
next if index($_, 'MsgTrace') == -1;
I like your anchoring idea though. I'll have to remember that one.
| [reply] [d/l] [select] |