in reply to Re: Re: The need for speed
in thread The need for speed

The reason for the multiple greps.

The reason is irrelevant. The point is only that you are looping through data multiple times. If you want it to be fast, do your work on one pass through the data if it is at all possible. In your case, it is possible. You could even roll at least some of the work into your following foreach loop. In your process_it() sub you loop through the data a minimum of four(!!!) times. Five sometimes. Lots of loops and speed don't mix.

On the whole conditional against $id, it really is just style. If I am only doing 1 thing based on truth, I inline it, if not I use the braces.

Well, some of it is a matter of style. I threw you off by using unless in that manner. The code

if (!$id || $id =~ /^(\s+|)$/)
isn't particularly good for a couple reasons. First, the !$id is not expressing what you are trying to say. Granted, you probably aren't going to have a message ID which evaluates to '0' but if you did... oops. Also, it probably isn't optimizing anything. If you have spaces, you have to check both clauses. Using if (/\S/) is much clearer and might well be a little more efficient in the long run too.

oh yeah.. in terms of tossing the data away. If I dont I run out of memory. I need to only process one file, extract the relevant data, then I need to clean my %data out or I simply dont have any memory left.

Declare %data with my just inside your foreach $file loop. Don't undef it one key at a time.

It amazing how your paradigm shifts along with the size of your data set :P

Oh, I don't know... I've dealt with biggish datasets in the tens and hundreds of gigs. It's true there are some unique logistical considerations and some shortcuts you can't take but the basics of writing efficient code stay the same. The real "paradigm shift" comes when you have to break your problem down such that it can be distributed to multiple machines.

-sauoq
"My two cents aren't worth a dime.";