mdunnbass has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

First and foremost, I wanted to express some huge thanks to BrowserUK, ikegami, xdg, gaal, and the many others who have offered help and given me great advice this past week. It's helped me tremendously.

So, from my previous post, I am trying to make a cumbersome and slow script run more quickly and efficiently. I have done a bunch of tuning, and the entirety of the script runs great. The only snag is that 1 subroutine is taking over 480 seconds to run, and I would love to drop that runtime, obviously.

The short version is that it is taking an enormous HoHoA (H1 -> ~22k key/value pairs, H2 -> ~5 key/value pairs each, A -> ~500 elements each), and parsing it into an HoAoHoA, based on the values of the A elements and a few global variables.

I've posted my original (working) sub behind the first readmore below, and a second (also working) version of the same sub behind the second readmore. Based on some friendsly advice from BrowserUK, I tried tightening up the sub, and the result was the second version. Although it works, changing out a series of loops for a grep call wound up costing me an extra 400 seconds, almost doubling the runtime of the sub.

If anyone is willing to help me figure out why the sub is taking forever, and would be willing to help me streamline it, I'd greatly appreciate it.

This code has a large quantity of comments in it, hopefully, allowing everyone to understand more what it is I am doing.
############# Begin WEED subroutine ############## #### # This subroutine sifts through the arrays of all the matches to the # sites found in SEARCHFASTA, and organizes them into 'sets' correspo +nding # to each hit and all of the other hits occurring within $span of it. # The format of the incoming hash, %matches, is a HoHoA. The breakdo +wn # is as follows: Initial hash keys are FASTA_ID lines. The Arrays s +tored # as the values for these keys are hashes. The number of key/value p +airs # corresponds to the number of user input search patterns. The value +s # for these keys are arrays, which contain the position of m//g match +es # within that FASTA_ID for that search pattern. # The output of the subroutine will be a more complex structure. We +will # be splitting the subhashes into arrays of (identically organized) s +ubhashes. # We want to gather together all groups of m//g hits (of all the patt +erns) # that occur within $span distance of each other, keeping their hash # organization. Each group will correspond to an array element. #### sub WEED { my ($span, %matches) = @_; my ($site, $fasta, $sitekey) = ''; my ($setscounter,$lowerlimit,$upperlimit,$set,$yes) = 0; my %sets = (); my @newfast; my $i = 0; while ($i < $num) { $sortedKeys[$i] = $i; $i++; } # Rather than the unsorted method of (keys %matches), I iterate throu +gh # the fastheaders array, which stores the fasta_ids of interest, in t +heir # order within the file. This array will need updating at the end of +this # subroutine for my $fasta_id (@fastheaders) { $setscounter = 0; # We only want to keep groups that contain matches for all patterns next unless defined %{$matches{$fasta_id}}; # keep the user updated on program progression print "Currently analysing results from sequence:\n $fasta_id\n" +; # The key $site below is an index value (0, 1, 2, ..), so incremen +ting # in numerical order makes the most sense. It corresponds to the +user- # input patterns that were searched for for my $site ( @sortedKeys ) { # We only want to keep groups that contain matches for all patte +rns last unless @{$matches{$fasta_id}{$site}}; $i = 0; # We are looking for all matches within $span of EVERY match, so + # for each match, find it's position, and the position $span awa +y WEEDLOW: for my $low (@{$matches{$fasta_id}{$site}}) { $lowerlimit = $low + 0; $upperlimit = $span + $lowerlimit; # Now, any and all matches between $lowerlimit and $upperlimit + should be # saved into an %HoA for my $sitekey ( @sortedKeys ) { next unless defined @{$matches{$fasta_id}{$sitekey}}; my @arrayA = (); # this part should be self-explanatory, though it is the core o +f the # sub. For each element, where is it located? If it is below +the lower # limit (only occurring when $lowerlimit defined by a different + $sitekey) # then proceed to the next. If it is above the upperlimit, the +n there # won't be any more within the array, so skip to next sitekey for my $hit (@{$matches{$fasta_id}{$sitekey}}) { next unless ($hit >= $lowerlimit); last unless ($hit <= $upperlimit); # the only reason for the $ggg is that I was having referencing + problems, and # this seemed to alleviate those. I do not understand why. my $ggg = $hit + 0; push (@arrayA, $ggg); $ggg = 0; next; } # Now, if we saved matches to an array above, then we want to k +eep them. # setcounter keeps track of how many sets there are, with the s +ets being # subhashes containing arrays of hits from ALL sitekeys. if (@arrayA) { @{$sets{$fasta_id}[$setscounter]{$sitekey}} = @arrayA; @arrayA = (); } # However, if we did not save any matches, then this whole set +($setscounter) # is pointless and should be discarded. else { %{$sets{$fasta_id}[$setscounter]} = (); next WEEDLOW; } } # a possibly redundant check to make sure we have matches from +ALL patterns for my $checkhash (keys %{$sets{$fasta_id}[$setscounter]}) { unless (defined $sets{$fasta_id}[$setscounter]{$checkhash}[0] +) { %{$sets{$fasta_id}[$setscounter]} = (); last; } } # and another... if (scalar keys %{$sets{$fasta_id}[$setscounter]} < $num) { %{$sets{$fasta_id}[$setscounter]} = (); } # Finally, if we are sure this is a hit, then increment setscou +nter # to get the next set $setscounter++ if scalar %{$sets{$fasta_id}[$setscounter]}; # closes for my low } #closes for my site } # Originally, I was getting arrays of empty subhashes, since $setsco +unter was # being initialized, though never having a subhash assignedto it. T +herefore, # I threw this in. If there is a subhash present as the last elemen +t of the array, # the scalar call will come back with the bits being used (1/8, etc) +. Otherwise, # it will return 0. If it's 0, we pop that out, and there's no arra +y remaining. if (@{$sets{$fasta_id}}) { pop @{$sets{$fasta_id}} unless scalar %{$sets{$fasta_id}[$#{$sets{ +$fasta_id}}]}; } # this will reset the @fastarray array, since we have possibly removed + fasta_id keys from # our hash of matches. if (@{$sets{$fasta_id}}) { push(@newfast,$h); } # similarly, we want to get rid of empty hash elements. else { delete $sets{$fasta_id}; } #closes for fastarray } @fastarray=(); # as @fastarray is global, this will have global effect, and we do not + need to return it. @fastarray=@newfast; return %sets; #closes subroutine } ############# End WEED subroutine ################

And here is the Second sub: BrowserUK took out a bunch of the leftover useless variable declarations, and removed a few things here and there. The biggest change, and the one that resulted in the huge increase in time, was replacing my

for my $hit (@{$matches{$fasta_id}{$sitekey}}) { next unless ($hit >= $lowerlimit); last unless ($hit <= $upperlimit);

sections with a grep routine instead. So, apparently, that wasn't a great idea. THough it sounded perfectly reasonable.
Anyway, the code is below:
sub WEED { my ($span, %matches) = @_; my %sets; my @newfast; my @sortedKeys; my $i = 0; while ($i < $num) { $sortedKeys[$i] = $i; $i++; } # Rather than the unsorted method of (keys %matches), I iterate throu +gh # the fastheaders array, which stores the fasta_ids of interest, in t +heir # order within the file. This array will need updating at the end of +this # subroutine WEEDFAST: for my $fasta_id (@fastheaders) { my $setscounter = 0; # We only want to keep groups that contain matches for all patterns next unless defined %{$matches{$fasta_id}}; # keep the user updated on program progression print "Currently analysing results from sequence:\n $fasta_id\n" +; # The key $site below is an index value (0, 1, 2, ..), so incremen +ting # in numerical order makes the most sense. It corresponds to the +user- # input patterns that were searched for for my $site ( @sortedKeys ) { # We only want to keep groups that contain matches for all patte +rns last unless defined @{$matches{$fasta_id}{$site}}; # We are looking for all matches within $span of EVERY match, so + # for each match, find it's position, and the position $span awa +y WEEDLOW: for my $low (@{$matches{$fasta_id}{$site}}) { my $lowerlimit = $low + 0; my $upperlimit = $span + $lowerlimit; # Now, any and all matches between $lowerlimit and $upperlimit + should be # saved into an %HoA for my $sitekey ( @sortedKeys ) { next unless defined @{$matches{$fasta_id}{$sitekey}}; # this part should be self-explanatory, though it is the core o +f the # sub. For each element, where is it located? If it is below +the lower # limit (only occurring when $lowerlimit defined by a different + $sitkey) # then proceed to the next. If it is above the upperlimit, the +n there # won't be any more within the array, so skip to next sitekey @{$sets{$fasta_id}[$setscounter]{$sitekey}} = grep { $_ >= $lowerlimit and $_ <= $upperlimit } @{$matches{$fasta_id}{$sitekey}}; unless (@{$sets{$fasta_id}[$setscounter]{$sitekey}}) { %{$sets{$fasta_id}[$setscounter]} = (); next WEEDLOW; } } # a possibly redundant check to make sure we have matches from +ALL patterns if (scalar keys %{$sets{$fasta_id}[$setscounter]} < $num) { %{$sets{$fasta_id}[$setscounter]} = (); } # Finally, if we are sure this is a hit, then increment setscou +nter # to get the next set $setscounter++ if scalar %{$sets{$fasta_id}[$setscounter]}; # closes for my low } #closes for my site } # Originally, I was getting arrays of empty subhashes, since $setsco +unter was # being initialized, though never having a subhash assigned to it. +Therefore, # I threw this in. If there is a subhash present as the last elemen +t of the array, # the scalar call will come back with the bits being used (1/8, etc) +. Otherwise, # it will return 0. If it's 0, we pop that out, and there's no arra +y remaining. unless (defined @{$sets{$fasta_id}}) { delete $sets{$fasta_id}; next WEEDFAST; } if (@{$sets{$fasta_id}}) { pop @{$sets{$fasta_id}} unless scalar %{$sets{$fasta_id}[$#{$sets{ +$fasta_id}}]}; } # this will reset the @fastarray array, since we have possibly removed + fasta_id keys from # our hash of matches. if (@{$sets{$fasta_id}}) { push(@newfast,$fasta_id); } # similarly, we want to get rid of empty hash elements. else { delete $sets{$fasta_id}; } #closes for fastarray } # as @fastarray is global, this will have global effect, and we do not + need to return it. @fastarray=@newfast; return %sets; #closes subroutine }
Thanks everybody
Matt

Replies are listed 'Best First'.
Re: Help tightening up a subroutine please
by gaal (Parson) on Jan 23, 2007 at 18:25 UTC
    First, grep goes over the whole list. If you only need to stop at the first successful match, don't use it. You can use List::Util::first instead.

    Other than that, just a blitz look.

    > while ($i < $num) { > $sortedKeys[$i] = $i; > $i++; > }

    This can be written simply as

    @sortedKeys = (0 .. $num - 1);

    > next unless defined %{$matches{$fasta_id}}; > last unless defined @{$matches{$fasta_id}{$site}};

    These may not always be true, but they autovivify an element so *will* be true the second time over you test them with the same key.

    > # as @fastarray is global, this will have global effect, and we do +not need to return it. > > @fastarray=@newfast;

    If the array is big, you'd do much better just updating a reference to it rather than the above, as you are copying all the elements. Avoiding globals wherever you can is generally a good idea...

    Update: Most of my points contained mistakes: see below for corrections by kyle++ and BrowserUk++. That'd teach me to post when tired. The advice against globals is probably sound though. :-)

      These may not always be true, but they autovivify an element so *will* be true the second time over you test them with the same key.

      I don't think that's true.

      my %x; print "one\n" if defined %{$x{a}}; print "two\n" if defined %{$x{a}}; print Dumper( \%x );

      It does autovivify, but it does so to a value that's still not defined (when tested that way, anyway). Full output:

      $VAR1 = { 'a' => {} };

      UPDATE: I should add that I do not think this is a good way to do this check. It would be better to check that the key exists and then, if you need to know that there's something in there, dereference the array and check that.

        Huh. You sent me to the manual since this greatly surprised me. FWIW you're right—thanks for the correction—but some of what I did remember does hold:

          Use of "defined" on aggregates (hashes and arrays) is deprecated. It used to report whether memory for that aggregate has ever been allocated. This behavior may disappear in future versions of Perl. You should instead use a simple test for size:

          if (@an_array) { print "has array elements\n" } if (%a_hash) { print "has hash members\n" }

          When used on a hash element, it tells you whether the value is defined, not whether the key exists in the hash. Use "exists" for the latter purpose.

        Update: Bah, I should head my own advice. The second statement never gets a chance to execute.

        You're looking at the statements in isolation. Look at them as a pair.

        Original post: (WRONG!)

        # First time next unless defined %{$matches{$fasta_id}}; # False last unless defined @{$matches{$fasta_id}{$site}}; # False # Second time next unless defined %{$matches{$fasta_id}}; # Now true!! last unless defined @{$matches{$fasta_id}{$site}};

        What really happens:

        # First time next unless defined %{$matches{$fasta_id}}; # False -> next last unless defined @{$matches{$fasta_id}{$site}}; # Skipped by next # Second time next unless defined %{$matches{$fasta_id}}; # Still false last unless defined @{$matches{$fasta_id}{$site}};
      First, grep goes over the whole list. If you only need to stop at the first successful match, don't use it. You can use List::Util::first instead.

      For this particular application, a bandpass filter rather than a low pass filter, List::Util::first is no substitute for grep.

      ( Actually, first() isn't even a low pass filter as it returns a single value not a filtered list but...)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Help tightening up a subroutine please
by BrowserUk (Patriarch) on Jan 23, 2007 at 18:36 UTC

    Hmm. Do you remember this?

    There is a possible caveat with this change! Your explicit loop does allow you to short circuit the loop at the high pass limit which isn't possible with grep. However, as shown above, using grep does allow you to avoid the creation of and later copying of the temporary array. This combined with greps inherent better performance may outweigh the benefit of that short circuiting. Or it may not.

    The only way to tell will be to make the change and benchmark. If the size of the array being filtered is sufficiently large, and the pass band sufficiently narrow and low down in the range that the short circuit is beneficial, then it would be better to use a binary search algorithm to discover the low and high range limits and then copy the values across to the results set using an array slice.

    Try substituting this for the grep.

    my $aref = $matches{ $fasta_id }{ $sitekey }; my( $lo, $hi ) = ( 0, scalar @{ $aref } ); ++$lo while $aref->[ $lo + 1 ] < $lowerlimit; --$hi while $aref->[ $hi - 1 ] > $upperlimit; @{ $sets{ $fasta_id }[ $setscounter ]{ $sitekey } } = @{ $aref }[ lo +.. $hi ];

    If that improves your performance, then using a binary search (as mentioned before) to find the lower and upper bounds should improve it further. Though binary searches coded in Perl aren't always as effective as you might hope.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      So, I have implemented this, with the following minor change (to account for boundary conditions):
      my $aref = $matches{ $fasta_id }{ $sitekey }; my( $lo, $hi ) = ( 0, scalar @{ $aref } ); ++$lo while ($aref->[ $lo + 1 ] < $lowerlimit && $lo < $hi-2); --$hi while ($aref->[ $hi - 1 ] > $upperlimit && $hi > 0 );

      I would love to post the timing results from dprofpp, but after 25 minutes, it is still running. And, since I have a print statement in there for just this purpose, I know that it is not hung up, the program is still running as intended. So, the binary selector in this case is even slower than the first 2 alternatives.

      Back to the drawing board, I guess. I'd really love to make this sub run much much faster.

      Hey, here's a random thought.... Is there a way to quickly zoom in to a location in an array that your variable's value is near? Like:

      @a = ( 1 .. 100 ); $val = 43.56; $nearby = zoom($val,@a); # $nearby now is essentially an index for @a somewhere in the near # vicinity of 43.56, say 40 or so..
      That would be great.... Oh well.

      Thanks,
      Matt

        You could try profiling with this. It does a bastardized binary chop on both ends to locate the passband and then steps back to locate the limits:

        sub bandpass { my( $aref, $loValue, $hiValue ) = @_; return if $loValue > $aref->[-1] or $hiValue < $aref->[0]; my( $lo, $hi ) = ( 1, $#{ $aref } ); $lo += $lo while $aref->[$lo] < $loValue; --$lo while $lo and $aref->[$lo-1] >= $loValue; $hi >>= 1 while $aref->[$hi] > $hiValue; ++$hi while $hi < $#$aref and $aref->[$hi+1] <= $hiValue; return @{ $aref }[ $lo .. $hi ]; } ... my $aref = $matches{ $fasta_id }{ $sitekey }; $sets{ $fasta_id }[ $setscounter ]{ $sitekey } = [ bandpass( $aref, $lowerlimit, $upperlimit ) ];

        The bastardization is designed to handle non-unique and/missing values in the set and still locate the appropriate boundaries. If you know your values will always be unique that could simplify things a little. If you knew that the limits ($lowerlimit & $upperlimit) would always be contained within the set, that would simplify things considerably, but from what you've said elsewhere that is not the case.


        Update: Further testing with more realistic test data showed the above to be both inefficient and flawed.

        The problem with optimising this is the range of possible variations. Optimising for the general case is likely to do very badly for particular cases. Your original solution works well for narrow bands that are low in the range:

        min---------loPass--hiPass------------------------------------max

        But will perform badly for either of the following, where grep will perform quite well:

        min-------------------------------------------loPass--hiPass--max min--loPass-------------------------------------------hiPass--max

        For the latter of the these two, scanning in from either end will be optimal as with my crude linear solution, but from what you've said about the datasets, it seems that your pass bands are quite narrow, but can be located anywhere within the entire range. And that is the hardest scenario to optimise for.

        Here is another version that I believe to be both correct; and tailored towards being efficient for the type of data you've indicated you are working with. The revelation came to me that instead of using two binary chops to locate the two ends of the pass band--which is quite expensive as your limits may not appear in the dataset--I could use a single binary chop to locate a (single) value that lies within the passband and then use two linear searches out from that value to locate the lower and upper limits. If your pass bands are narrow, these linear searches will have little to do. They will also neatly handle the possibility of duplicate values which complicates binary chops.

        sub bandpass { my( $aref, $loValue, $hiValue ) = @_; return if $loValue > $aref->[-1] or $hiValue < $aref->[0]; my( $lo, $hi ) = ( 0, $#{ $aref } ); while( $lo < $hi ) { my $mid = int( ( $lo + $hi ) / 2 ); if( $aref->[ $mid ] >= $loValue and $aref->[ $mid ] <= $hiValu +e ) { $lo = $hi = $mid; last; } elsif( $aref->[ $mid ] < $loValue ) { $lo = $mid + 1; } elsif( $aref->[ $mid ] > $hiValue ) { $hi = $mid - 1; } } return if $loValue > $aref->[ $hi ] or $hiValue < $aref->[ $lo ]; --$lo while $lo and $aref->[ $lo - 1 ] >= $loValue; ++$hi while $hi < $#{ $aref } and $aref->[ $hi + 1 ] <= $hiValue; return @{ $aref }[ $lo .. $hi ]; }

        Again, no guarentees. It's very hard to do this kind of optimisation without real datasets to benchmark against.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I would choose Benchmark to do the comparison and go to bed if I were you...

        -Paul

Re: Help tightening up a subroutine please
by radiantmatrix (Parson) on Jan 23, 2007 at 20:15 UTC

    I am willing to bet your grep is the slowest bit, but have you profiled the code? Anyhow, assuming the grep bit is your bottleneck:

    ## your original code @{$sets{$fasta_id}[$setscounter]{$sitekey}} = grep { $_ >= $lowerlimit and $_ <= $upperlimit } @{$matches{$fasta_id}{$sitekey}};

    Depending on your data, you may gain some performance by using a sorted array instead:

    @{$sets{$fasta_id}[$setscounter]{$sitekey}} = (); foreach ( sort @{$matches{$fasta_id}{$sitekey}} ) { last if $_ > $upperlimit; next if $_ < $lowerlimit; push @{$sets{$fasta_id}[$setscounter]{$sitekey}}, $_; }

    This will only have an advantage if sort is sufficiently fast and there are many elements above $upperlimit, so this suggestion comes with an extreme case of "know thy data".

    <radiant.matrix>
    Ramblings and references
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet
      Your code here:
      @{$sets{$fasta_id}[$setscounter]{$sitekey}} = (); foreach ( sort @{$matches{$fasta_id}{$sitekey}} ) { last if $_ > $upperlimit; next if $_ < $lowerlimit; push @{$sets{$fasta_id}[$setscounter]{$sitekey}}, $_;

      Looks a lot like my original code, in the first bit, here:
      for my $hit (@{$matches{$fasta_id}{$sitekey}}) { next unless ($hit >= $lowerlimit); last unless ($hit <= $upperlimit); my $ggg = $hit + 0; push (@arrayA, $ggg);

      And indeed, it does have a speed advantage. I don't need the sort, as the elements are already sorted in numerical order, so I can (and do) leave that out. As for knowing thy data, this here's the tricky part. I am going through the numbers in the arrays, and basically setting aside clusters of EVERY group of numbers $span distance apart. Starting from the lowest element, and proceeding all the way to the highest. So, how many elements there are above $upperlimit is as variable as possible. It goes from all to none.

      Thanks for the input, though.
      Matt

        Hm, somehow I missed that. Given the extra bit of data you supplied, I might recommend the use of List::MoreUtil's firstidx and lastidx functions. The fact that it's XS ought to help with the speed.

        # for clarity... my $result; my $src = $matches{$fasta_id}{$sitekey}; # Use a slice and firstidx/lastidx to copy the range of values @newArray = @$src[ (firstidx {$_ >= $lowerlimit} @$src )..(lastidx {$_ <= $upperlimit} + @$src) ]; $result = \@newArray;
        <radiant.matrix>
        Ramblings and references
        The Code that can be seen is not the true Code
        I haven't found a problem yet that can't be solved by a well-placed trebuchet
Re: Help tightening up a subroutine please
by jwkrahn (Abbot) on Jan 24, 2007 at 01:19 UTC
    24 next unless defined %{$matches{$fasta_id}}; 38 last unless defined @{$matches{$fasta_id}{$site}}; 51 next unless defined @{$matches{$fasta_id}{$sitekey}} +; 91 unless (defined @{$sets{$fasta_id}}) {
    From the man page for defined:
    Use of "defined" on aggregates (hashes and arrays) is deprecated. It +used to report whether memory for that aggregate has ever been alloca +ted. This behavior may disappear in future versions of Perl. You should instead + use a simple test for size: if (@an_array) { print "has array elements\n" } if (%a_hash) { print "has hash members\n" }

Re: Help tightening up a subroutine please
by eric256 (Parson) on Jan 24, 2007 at 04:37 UTC

    I'm having kinda of a difficult time visualizing your data, but it looks like you are scanning over @sortedKeys multiple times. Could you rearrange the logic to scan outward from the current item instead of continually scanning over @sortedKeys?

    If you could provide a small bit of code that actually uses this sub, giving it data and showing us the expected outcome it might be a fun challenge for some of us! ;) It could be a significantly smaller set of data than you are working with, just enough to show the input/output is working correctly.

    Good luck!


    ___________
    Eric Hodges
      That's a very good point. Words can never describe things as well as what you can manipulate by hand. So, here's a very small sample input, with what the desired output should be. I hope this helps anyone interested in helping me out....

      Thanks to everyone who's responded so far. This has been informative.
      Matt