Grep Effeciency

ImpalaSS has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(tye)Re: Grep Effeciency by tye (Sage) on Feb 28, 2001 at 00:49 UTC
You are looping over @ECL_STAT 21200 times! Loop over it once, storing the data in a structure that is efficient to look up by $NETID: `for( @ECL_STAT ) { my( $NETID )= split /\\|/, $_; push @{$byID{$NETID}}, $_; }` [download] Then you can do your calculations on that. Update*: Change 4 to 2. Thanks to arturo for the sharp eyes. - tye (but my friends call me "Tye")	[reply] [d/l]
Re: (tye)Re: Grep Effeciency by ImpalaSS (Monk) on Feb 28, 2001 at 01:23 UTC
Hey Tye, Exactly, thats the problem. The array is searched 1200 times. I like your idea of searching it just once and splitting it up. Here is a sample of the file (CEL_STAT is exactly the same, only the numbers after the netid are different:) `The netid is first: 21352pa\|02/27/2001\|03:30\|25\|363\|32526.4363\|363\|3627524\|` [download] I shortened the line, each line contains about 70 fields of data. So, basically, there is the time (03:30 in this case) and the netid. The file, is searched for every NETID and it finds around 50 (48 to be exact). one set of data for every half hour. Then this data is split at the \| and then calculated on. The only way i can think about making it faster is as it is searched, and the 50 or so lines are removed, to just delete them, thus making the main array smaller and the search will be a little faster. I like your idea from above, but i really have no idea what thats doing, and it confuses me. Thanks Again :) Dipul	[reply] [d/l]
Re (tilly) 1: Grep Effeciency by tilly (Archbishop) on Feb 28, 2001 at 01:54 UTC
If you are scanning an array repeatedly, that is usually a sign that you wanted a hash first. So take your arrays and turn them into hashes of arrays in one step. `foreach @ECL_STAT { if (/^(\w+)/) { # The net id is $1... push @{$ECL_LOOKUP{$1}}, $_; } } # Time passes foreach my $key (@totalkeys) { $NETID = $totalnetname{$key}; @ECL_DATA = @{$ECL_LOOKUP{$NETID}}; # etc }` [download] So you see you scan @ECL_STAT just once to break it up into a hash of arrays, and then you find those arrays when you want them by a hash lookup. This should be substantially faster. BTW random note. I didn't change your names, but your variable names are very similar. And your loop variable was not lexical, which suggests to me that you are not using strict. Try using strict and see if that doesn't catch a few bugs...	[reply] [d/l]
(ichimunki) re: Grep Effeciency by ichimunki (Priest) on Feb 28, 2001 at 00:34 UTC
You are iterating over the same list two ~~four~~ times. Why? You could write this using "foreach" and simply use conditionals to test each regex (with a push for valid values). In fact, your second "grep" only gives you a subset of your first "grep", so make sure to pay attention for things like that as you recode. You can probably also skip some regexes by using "next" within the "for" loop, if there are mutually exlusive matches. Update: Oops. Dyslexia. Looping twice for each structure. Still once too often. (Note, I would also make those list names a little less like each other. {grin} )	[reply]
Re: (ichimunki) re: Grep Effeciency by ImpalaSS (Monk) on Feb 28, 2001 at 00:41 UTC
Well, There are 2 seperate files. The ECL_STAT and the CEL_STAT. Then there are 1200 NETID's. What im trying to do ( with maximim speed) is, grab all the data from (ECL_STAT), then split each line found, and perform calculations to certain fields. Then, do the samt thing to the CEL_STAT file using the same netid. It is faster if i dont use the second line (the !/^$NETID/) one, but i wanted to show what i want to do. I want it so that, once all the $NETID's (for example 12753pa), delete those lines from each array, thus making the arrays smaller and easier to search the next time around? I hope this makes some kind of sence Thanks Dipul	[reply]
Re: Grep Effeciency by baku (Scribe) on Feb 28, 2001 at 01:46 UTC
What scares me (potential-efficiency-wise) is the fact that you loop over the same array for each key. Not knowing the rest of your code, I'm going to take a stab that generating the list for all keys in a single pass down the list would be better... To preserve the "remnants" in @ECL_DATA: # create a "cache": "is this a NETID?" # (if I understand your variables correctly...) my %ids = map { $_ => 1 } values %totalnetname; my %ECLDATA = (); { # lexical scope for leftovers my @leftovers = (); while (my $entry = shift @ECL_STAT) { # ignore non-\|-delimited lines next unless my ($netid) = split /\\|/, $_; # push lines which begin with any NETID push ( $ids{$netid} ? @{ $ECLDATA{$netid} } : @leftovers ), $_; } @ECL_STAT = @leftovers; } for my $key (@totalkeys) { my $NETID = $totalnetname{$key}; my @ECLDATA = @{ $ECLDATA{$NETID} }; # ... lots, as before. } [download] If you're reasonably certain that every line begins with a NETID, you could get rid of some of that...	[reply] [d/l]
Re: Re: Grep Effeciency by ImpalaSS (Monk) on Feb 28, 2001 at 01:49 UTC
Hey Baku, The reason i loop over the same array, because the "#--- more code" line is actually, taking that array, and doing a foreach loop on it, splitting every line at the \| then, doing the calulations and printing the results. Is this inneffecient? I figure searching just once for the NETID and just parsing that data, then moving on is the best way? But, I dont knonw nearly as much about perl as many people here :) Thanks Dipul	[reply]
Ra: Grep Effeciency by baku (Scribe) on Feb 28, 2001 at 23:56 UTC
Sorry for my terseness yesterday, that bloody "real world" kept interfering with my PerlMonks time :-) Here's a breakdown of what I came up with, with the change that we now split the data only once. # create a "cache": "is this a NETID?" # (if I understand your variables correctly...) # # map takes in the list of values of %totalnetname, which # I am "assuming" are all of the "interesting" net ID's. # In order to look them up quickly, we build a hash by # emitting (from inside the {}) key=>value pairs of the # form $id => 1. # # If there were "uninteresting" ID's, now would be the # ideal time (IMHO) to "filter" them out. This could be # most easily done by changing the BLOCK to something like # { &is_it_interesting($_) ? ( $_ => 1 ) : () } # # Better yet, never put in the "uninteresting" values # (via %totalnetname) to begin with... # # (As a guess: is %totalnetname perhaps a tied DB_File? # That would probably make editing it directly unwise in # this situation. If it's not, and the NETIDs are fairly # constant, putting it into a DB_File would likely be a # good idea, YMMV/TIMTOWTDI/standard disclaimers...) # my %ids = map { $_ => 1 } values %totalnetname; # # Clear out the 'output' hash, declaring it lexically # (lexically ~~ "my") # my %ECLDATA = (); # # This brace starts a lexical scope so that our messy # temporary variables are garbage-collected when we're # done with them. There's likely a way to do this without # temporary variables -- it seems that there always is -- # but that will have to be [merlyn]'s problem to right ;-) # { # lexical scope for leftovers # # Where do bad records go? If the dataset is known (or at # least expected) to contain only NETID\|data... records, # there's no need to have a @leftovers array. # my @leftovers = (); # # This is almost a 'for my $entry (@ECL_STAT),' but it # isn't. Looking at it again today, I don't remember why # it isn't, so it probably could be, or even should be, # if only for readability. # # UPDATED: It's not, because, of course, shift removes its # value from @ECL_STAT, leaving it empty for @leftovers. # This is marginally unimportant, since @ECL_STAT will be # clobbered afterwards anyways, but might decrease memory # usage somewhat since each record is in RAM only ~~ 1ce # at a time. # while (my $entry = shift @ECL_STAT) { # ignore non-\|-delimited lines # # This 'rejects' any lines which don't contain \|'s. # Reading 'inner-first:' # 1. split $entry on \| (note my typo yesterday of $_!) # 2. assign (in list context) to $netid and @stuff # 3. assignment (in list context) returns the number # of items assigned, so the 'unless' is getting # a number >= 1 (since split will return $entry # if we have no \|'s). The missing >1 was another # hasty mistake of mine yesterday :-P # 4. go through the loop again if we got 1 (or 0?) next unless (my ($netid, @stuff) = split /\\|/, $entry) > 1; # push lines which begin with any NETID # # This stores a reference to a new list in either # $ECLDATA{$netid} (if the $netid we got, above, is in # %ids) or @leftovers (if not). This could be simplified # to push ( @{ $ECLDATA{$netid} } ), [ $netid, @stuff ]; # if every possible netid is "interesting." (This would # also remove the my %ids = map... above) # # NB. that @{ } is the 'array dereference' operator, # which just means 'give me the array which this is a # reference to, rather than the reference.' I read it # aloud as "the list/array referred to by ..." # # The use of references is due to the fact that an # array cannot be (directly) the 'value' of an hash # element. # push ( $ids{$netid} ? @{ $ECLDATA{$netid} } : @leftovers ), [ $netid, @stuff ]; } # # For compatibility with the original script, we push # back 'leftovers' into @ECL_STAT; this might be # unnecessary. # @ECL_STAT = @leftovers; # # At this point, @leftovers goes out of scope and is # removed from memory. # } # # This routine doesn't need to split the data, or any of # that, because it's picking up its data from the @ECLDATA # array. # for my $key (@totalkeys) { my $NETID = $totalnetname{$key}; # # Don't try to do any work if no data was found. # next unless exists $ECLDATA{$NETID}; # # This could also assign to the various 'field' # variables you might have, eg. # my ($name, $address, $city) = @{ $ECLDATA{$NETID} }; # my @ECLDATA = @{ $ECLDATA{$NETID} }; # ... lots, as before. } [download] Hopefully, that's a bit easier to understand. Good luck in your efforts!	[reply] [d/l]
Re: Grep Effeciency by I0 (Priest) on Feb 28, 2001 at 11:10 UTC
`@NETID{@totalnetname{@totalkeys}} = (); @ECLDATA = grep{ /^(\w+)\\|$datecompare/ && exists $NETID{$1} } @ECL_ST +AT;` [download]	[reply] [d/l]
Re: Grep Effeciency by ImpalaSS (Monk) on Feb 28, 2001 at 03:20 UTC
Well, With the help of everyone here, I learned how to use hashes a LOT more, and the script now does everything it needs to do in less than 2 minutes. Thats amazing, thanks to everyone! Dipul	[reply]