in reply to Re: The need for speed
in thread The need for speed

Thanks to all about the tr/// as opposed to s///, I hadn't come up with decent way to deal with how many recipients a message had, and I remembered a post here that used =~ s/// to count I think it was dots. So I used it cause it fit

On the $id thing. yes the appropriate regex is
$id = $1 if (/msgid=<[^>]+>/);
as the chars <> will never be valid chars in the ID.

The reason for the multiple greps. I have a data set, lets say in instance 1 its 3 lines, in instance 2 is 2 lines. In instance 1 we have
1 received mail line
1 Error-Handler line
1 bounced line

In instance 2 we have
1 received line
1 Error-Handler line

What I am attempting to deal with is those disparate lines. So I take the received line from the data set. Then if there is more than 1 item left in the data set, and Error-Handler lines are also in the data set I remove them. If there aren't then I leave it alone cause I need to count that the message came through, it was just processed outside the scope of the log file itself.

On the whole conditional against $id, it really is just style. If I am only doing 1 thing based on truth, I inline it, if not I use the braces.

Thanks for the pointer about not capturing if Im not using it. Makes sense

And as for your map usage, I am still a map newbie. Though I will definately see what I can get out of it in terms of mileage. Thanks for the input. oh yeah.. in terms of tossing the data away. If I dont I run out of memory. I need to only process one file, extract the relevant data, then I need to clean my %data out or I simply dont have any memory left. As you can see from the data samples the lines are long. It amazing how your paradigm shifts along with the size of your data set :P

/* And the Creator, against his better judgement, wrote man.c */

Replies are listed 'Best First'.
Re: Re: Re: The need for speed
by sauoq (Abbot) on Jan 24, 2003 at 07:57 UTC
    The reason for the multiple greps.

    The reason is irrelevant. The point is only that you are looping through data multiple times. If you want it to be fast, do your work on one pass through the data if it is at all possible. In your case, it is possible. You could even roll at least some of the work into your following foreach loop. In your process_it() sub you loop through the data a minimum of four(!!!) times. Five sometimes. Lots of loops and speed don't mix.

    On the whole conditional against $id, it really is just style. If I am only doing 1 thing based on truth, I inline it, if not I use the braces.

    Well, some of it is a matter of style. I threw you off by using unless in that manner. The code

    if (!$id || $id =~ /^(\s+|)$/)
    isn't particularly good for a couple reasons. First, the !$id is not expressing what you are trying to say. Granted, you probably aren't going to have a message ID which evaluates to '0' but if you did... oops. Also, it probably isn't optimizing anything. If you have spaces, you have to check both clauses. Using if (/\S/) is much clearer and might well be a little more efficient in the long run too.

    oh yeah.. in terms of tossing the data away. If I dont I run out of memory. I need to only process one file, extract the relevant data, then I need to clean my %data out or I simply dont have any memory left.

    Declare %data with my just inside your foreach $file loop. Don't undef it one key at a time.

    It amazing how your paradigm shifts along with the size of your data set :P

    Oh, I don't know... I've dealt with biggish datasets in the tens and hundreds of gigs. It's true there are some unique logistical considerations and some shortcuts you can't take but the basics of writing efficient code stay the same. The real "paradigm shift" comes when you have to break your problem down such that it can be distributed to multiple machines.

    -sauoq
    "My two cents aren't worth a dime.";