in reply to The need for speed

There are a couple things you can do to speed up your code a bit. I don't know if they'll give you the kind of improvements you are looking for though. You didn't mention what hardware you got those runtimes on, by the way. For heavy log processing, it often makes sense to use a dedicated host or hosts and this might be a problem worth throwing some hardware at as you need it in the future. Analyzing 6 hours worth of data in 4 hours still allows you to finish well ahead of starting the next batch so why isn't it quick enough? What are your requirements?

Following are some optimizations.

The first thing I see is that you use s/// to count characters. You should use tr/// for that instead.

my $count = $string =~ tr/@//;

I don't know how much it will help in your case but using alternation when a character class will do is generally frowned upon. Also, you might as well use precompiled expressions when you can. I'd change

$CHARS = '(<|>|\[|\])';
to
$CHARS = qr/([][<>])/;

In your first loop, why do you perform a substitution on $id after matching it with a regular expression in the first place? I would replace

$id = $1 if (/msgid=([^:]+):/); $id =~ s/(^<|>$)//g;
with
($id) = /msgid=<([^>]+)/;
It isn't exactly the same but I think it would work based on your sample data. If the msgid isn't always bracketed by '<' and '>' then it will break. It will also break if a '>' is permitted in a msgid. One way or the other though, there is bound to be a way to express what you need in one regex.

This may be more of a style issue, but in that first loop, you also have

if (!$id || $id =~ /^(\s+|)$/) { $no_id++; next; }
Why not just say $no_id++ and next unless $id =~ /\S/;? For that matter, you start it with next if (!/MsgTrace/); which would be better expressed with next unless /MsgTrace/; because it avoid the double negative.

In your next loop, you make a copy of your data each time. That's a waste especially since you go to lengths not to modify your copy and you throw both the copy and the original out when you are done. (In fact, you probably shouldn't go to the trouble throwing the original out, by the way. You aren't doing what you think anyway.) Just work with the original. That's bound to save you some time. Also, consider passing a reference to process_it() rather than the array itself. Also, in that function, each of those greps is a loop. Try to combine them. Instead of

(@data) = grep(!/received from internet:/, @in); (@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @data) && $data[1] );
try something like (untested):
if ($data[1]) { # Do you really mean that, by the way? my (@t1,@t2); @t1 = map { if (/Error-Handler/) { $_ } else { !/received from internet:/ and push @t2, $_; (); } } @data; @data = @t2 if @t1; }
There's probably a better way to do that but that's one way you might reduce those three loops (greps) to one (map).

Another micro-optimisation: Don't capture things you don't use in regexen:

if ($line[0] =~ /( |-)([a-zA-Z]+)$/) {
would be better written as
if ($line[0] =~ /[ -](a-zA-Z]+)$/) {
instead.

I imagine you can clean it up in other ways too. The most dramatic changes to performance usually come from algorithmic changes and I don't know enough about your problem to really think about a better way to do it. Eliminating loops within other loops is one rule of thumb that should help you a great deal as long as you remember that greps and maps are loops too.

-sauoq
"My two cents aren't worth a dime.";

Replies are listed 'Best First'.
Re^2: The need for speed
by Aristotle (Chancellor) on Jan 23, 2003 at 23:50 UTC
    ($id) = /msgid=<([^>]+)/;
    In this particular case I think the semantics are duplicated exactly by
    ($id) = /msgid=<?(.+?)>?:/;
    Though it is likely he really wants
    ($id) = /msgid=<(.+?)>:/;

    Makeshifts last the longest.

Re: Re: The need for speed
by l2kashe (Deacon) on Jan 24, 2003 at 05:56 UTC
    Thanks to all about the tr/// as opposed to s///, I hadn't come up with decent way to deal with how many recipients a message had, and I remembered a post here that used =~ s/// to count I think it was dots. So I used it cause it fit

    On the $id thing. yes the appropriate regex is
    $id = $1 if (/msgid=<[^>]+>/);
    as the chars <> will never be valid chars in the ID.

    The reason for the multiple greps. I have a data set, lets say in instance 1 its 3 lines, in instance 2 is 2 lines. In instance 1 we have
    1 received mail line
    1 Error-Handler line
    1 bounced line

    In instance 2 we have
    1 received line
    1 Error-Handler line

    What I am attempting to deal with is those disparate lines. So I take the received line from the data set. Then if there is more than 1 item left in the data set, and Error-Handler lines are also in the data set I remove them. If there aren't then I leave it alone cause I need to count that the message came through, it was just processed outside the scope of the log file itself.

    On the whole conditional against $id, it really is just style. If I am only doing 1 thing based on truth, I inline it, if not I use the braces.

    Thanks for the pointer about not capturing if Im not using it. Makes sense

    And as for your map usage, I am still a map newbie. Though I will definately see what I can get out of it in terms of mileage. Thanks for the input. oh yeah.. in terms of tossing the data away. If I dont I run out of memory. I need to only process one file, extract the relevant data, then I need to clean my %data out or I simply dont have any memory left. As you can see from the data samples the lines are long. It amazing how your paradigm shifts along with the size of your data set :P

    /* And the Creator, against his better judgement, wrote man.c */
      The reason for the multiple greps.

      The reason is irrelevant. The point is only that you are looping through data multiple times. If you want it to be fast, do your work on one pass through the data if it is at all possible. In your case, it is possible. You could even roll at least some of the work into your following foreach loop. In your process_it() sub you loop through the data a minimum of four(!!!) times. Five sometimes. Lots of loops and speed don't mix.

      On the whole conditional against $id, it really is just style. If I am only doing 1 thing based on truth, I inline it, if not I use the braces.

      Well, some of it is a matter of style. I threw you off by using unless in that manner. The code

      if (!$id || $id =~ /^(\s+|)$/)
      isn't particularly good for a couple reasons. First, the !$id is not expressing what you are trying to say. Granted, you probably aren't going to have a message ID which evaluates to '0' but if you did... oops. Also, it probably isn't optimizing anything. If you have spaces, you have to check both clauses. Using if (/\S/) is much clearer and might well be a little more efficient in the long run too.

      oh yeah.. in terms of tossing the data away. If I dont I run out of memory. I need to only process one file, extract the relevant data, then I need to clean my %data out or I simply dont have any memory left.

      Declare %data with my just inside your foreach $file loop. Don't undef it one key at a time.

      It amazing how your paradigm shifts along with the size of your data set :P

      Oh, I don't know... I've dealt with biggish datasets in the tens and hundreds of gigs. It's true there are some unique logistical considerations and some shortcuts you can't take but the basics of writing efficient code stay the same. The real "paradigm shift" comes when you have to break your problem down such that it can be distributed to multiple machines.

      -sauoq
      "My two cents aren't worth a dime.";