comment on

Greetings Monks,

First and foremost, I wanted to express some huge thanks to BrowserUK, ikegami, xdg, gaal, and the many others who have offered help and given me great advice this past week. It's helped me tremendously.

So, from my previous post, I am trying to make a cumbersome and slow script run more quickly and efficiently. I have done a bunch of tuning, and the entirety of the script runs great. The only snag is that 1 subroutine is taking over 480 seconds to run, and I would love to drop that runtime, obviously.

The short version is that it is taking an enormous HoHoA (H1 -> ~22k key/value pairs, H2 -> ~5 key/value pairs each, A -> ~500 elements each), and parsing it into an HoAoHoA, based on the values of the A elements and a few global variables.

I've posted my original (working) sub behind the first readmore below, and a second (also working) version of the same sub behind the second readmore. Based on some friendsly advice from BrowserUK, I tried tightening up the sub, and the result was the second version. Although it works, changing out a series of loops for a grep call wound up costing me an extra 400 seconds, almost doubling the runtime of the sub.

If anyone is willing to help me figure out why the sub is taking forever, and would be willing to help me streamline it, I'd greatly appreciate it.

This code has a large quantity of comments in it, hopefully, allowing everyone to understand more what it is I am doing.

############# Begin WEED subroutine ##############      
      
####
#  This subroutine sifts through the arrays of all the matches to the 
#  sites found in SEARCHFASTA, and organizes them into 'sets' correspo
+nding
#  to each hit and all of the other hits occurring within $span of it.
#  The format of the incoming hash, %matches, is a HoHoA.  The breakdo
+wn
#  is as follows:  Initial hash keys are FASTA_ID lines.  The Arrays s
+tored
#  as the values for these keys are hashes.  The number of key/value p
+airs
#  corresponds to the number of user input search patterns.  The value
+s 
#  for these keys are arrays, which contain the position of m//g match
+es
#  within that FASTA_ID for that search pattern.
#  The output of the subroutine will be a more complex structure.  We 
+will
#  be splitting the subhashes into arrays of (identically organized) s
+ubhashes.
#  We want to gather together all groups of m//g hits (of all the patt
+erns)
#  that occur within $span distance of each other, keeping their hash
#  organization.  Each group will correspond to an array element.
####

sub WEED {

 my ($span, %matches) = @_;
 my ($site, $fasta, $sitekey) = '';
 my ($setscounter,$lowerlimit,$upperlimit,$set,$yes) = 0;
 my %sets = ();
 my @newfast;
 
 my $i = 0;
 
 while ($i < $num) {
   $sortedKeys[$i] = $i;
   $i++;
   }

 # Rather than the unsorted method of (keys %matches), I iterate throu
+gh
 # the fastheaders array, which stores the fasta_ids of interest, in t
+heir
 # order within the file. This array will need updating at the end of 
+this 
 # subroutine
 for my $fasta_id (@fastheaders) {
  $setscounter = 0;
  
  # We only want to keep groups that contain matches for all patterns
  next unless defined %{$matches{$fasta_id}};
  
  # keep the user updated on program progression
  print "Currently analysing results from sequence:\n     $fasta_id\n"
+;
   
   #  The key $site below is an index value (0, 1, 2, ..), so incremen
+ting
   #  in numerical order makes the most sense.  It corresponds to the 
+user-
   #  input patterns that were searched for
   for my $site ( @sortedKeys ) {

     #  We only want to keep groups that contain matches for all patte
+rns
     last unless @{$matches{$fasta_id}{$site}};
     $i = 0;
    
     #  We are looking for all matches within $span of EVERY match, so
+ 
     #  for each match, find it's position, and the position $span awa
+y
          
     WEEDLOW: for my $low (@{$matches{$fasta_id}{$site}}) {
       $lowerlimit = $low + 0;
       $upperlimit = $span + $lowerlimit;
    
       #  Now, any and all matches between $lowerlimit and $upperlimit
+ should be 
       #  saved into an %HoA
       
       for my $sitekey (  @sortedKeys ) {
          next unless defined @{$matches{$fasta_id}{$sitekey}};
          my @arrayA = ();
    
       # this part should be self-explanatory, though it is the core o
+f the
       # sub.  For each element, where is it located?  If it is below 
+the lower
       # limit (only occurring when $lowerlimit defined by a different
+ $sitekey)
       # then proceed to the next.  If it is above the upperlimit, the
+n there
       # won't be any more within the array, so skip to next sitekey
      
          for my $hit (@{$matches{$fasta_id}{$sitekey}}) {
            next unless ($hit >= $lowerlimit);
            last unless ($hit <= $upperlimit);
     
       # the only reason for the $ggg is that I was having referencing
+ problems, and
       # this seemed to alleviate those.  I do not understand why.
       
            my $ggg = $hit + 0;
            push (@arrayA, $ggg);
            $ggg = 0;
            next;
            }
       # Now, if we saved matches to an array above, then we want to k
+eep them.    
       # setcounter keeps track of how many sets there are, with the s
+ets being
       # subhashes containing arrays of hits from ALL sitekeys.
          if (@arrayA) {
            @{$sets{$fasta_id}[$setscounter]{$sitekey}} = @arrayA;
            @arrayA = ();
            }
       # However, if we did not save any matches, then this whole set 
+($setscounter)
       # is pointless and should be discarded.
            else {
              %{$sets{$fasta_id}[$setscounter]} = ();
              next WEEDLOW;
              } 
          } 
 
       # a possibly redundant check to make sure we have matches from 
+ALL patterns
    
       for my $checkhash (keys %{$sets{$fasta_id}[$setscounter]}) {
         unless (defined $sets{$fasta_id}[$setscounter]{$checkhash}[0]
+) {
           %{$sets{$fasta_id}[$setscounter]} = ();
           last;
           }
         }
       # and another...
       
       if (scalar keys %{$sets{$fasta_id}[$setscounter]} < $num) {
         %{$sets{$fasta_id}[$setscounter]} = ();
         } 
       # Finally, if we are sure this is a hit, then increment setscou
+nter
       # to get the next set 
       $setscounter++ if scalar %{$sets{$fasta_id}[$setscounter]};
         
       # closes for my low
       }
    #closes for my site
    }
  
  # Originally, I was getting arrays of empty subhashes, since $setsco
+unter was
  # being initialized, though never having a subhash assignedto it.  T
+herefore, 
  # I threw this in.  If there is a subhash present as the last elemen
+t of the array,
  # the scalar call will come back with the bits being used (1/8, etc)
+.  Otherwise,
  # it will return 0.  If it's 0, we pop that out, and there's no arra
+y remaining.

  if (@{$sets{$fasta_id}}) {
    pop @{$sets{$fasta_id}} unless scalar %{$sets{$fasta_id}[$#{$sets{
+$fasta_id}}]};
    }

# this will reset the @fastarray array, since we have possibly removed
+ fasta_id keys from
# our hash of matches.

  if (@{$sets{$fasta_id}}) {
    push(@newfast,$h);
    }
  # similarly, we want to get rid of empty hash elements.  
  else {
    delete $sets{$fasta_id};
    }
  
  #closes for fastarray
  }
 @fastarray=();

# as @fastarray is global, this will have global effect, and we do not
+ need to return it.

 @fastarray=@newfast;
 return %sets;
 #closes subroutine
 }

############# End WEED subroutine ################
[download]

And here is the Second sub: BrowserUK took out a bunch of the leftover useless variable declarations, and removed a few things here and there. The biggest change, and the one that resulted in the huge increase in time, was replacing my

          for my $hit (@{$matches{$fasta_id}{$sitekey}}) {
            next unless ($hit >= $lowerlimit);
            last unless ($hit <= $upperlimit);
[download]

sections with a grep routine instead. So, apparently, that wasn't a great idea. THough it sounded perfectly reasonable.

Anyway, the code is below:

sub WEED {

 my ($span, %matches) = @_;
 my %sets;
 my @newfast;
 my @sortedKeys;
 my $i = 0;
 
 while ($i < $num) {
   $sortedKeys[$i] = $i;
   $i++;
   }
 
 # Rather than the unsorted method of (keys %matches), I iterate throu
+gh
 # the fastheaders array, which stores the fasta_ids of interest, in t
+heir
 # order within the file. This array will need updating at the end of 
+this 
 # subroutine

 WEEDFAST: for my $fasta_id (@fastheaders) {
  my $setscounter = 0;
  
  # We only want to keep groups that contain matches for all patterns
 
  next unless defined %{$matches{$fasta_id}};
  
  # keep the user updated on program progression
 
  print "Currently analysing results from sequence:\n     $fasta_id\n"
+;
   
   #  The key $site below is an index value (0, 1, 2, ..), so incremen
+ting
   #  in numerical order makes the most sense.  It corresponds to the 
+user-
   #  input patterns that were searched for
 
   for my $site ( @sortedKeys ) {

     #  We only want to keep groups that contain matches for all patte
+rns
  
     last unless defined @{$matches{$fasta_id}{$site}};
     
     #  We are looking for all matches within $span of EVERY match, so
+ 
     #  for each match, find it's position, and the position $span awa
+y
          
     WEEDLOW: for my $low (@{$matches{$fasta_id}{$site}}) {
       my $lowerlimit = $low + 0;
       my $upperlimit = $span + $lowerlimit;
    
       #  Now, any and all matches between $lowerlimit and $upperlimit
+ should be 
       #  saved into an %HoA
       
       for my $sitekey (  @sortedKeys ) {
          next unless defined @{$matches{$fasta_id}{$sitekey}};
    
       # this part should be self-explanatory, though it is the core o
+f the
       # sub.  For each element, where is it located?  If it is below 
+the lower
       # limit (only occurring when $lowerlimit defined by a different
+ $sitkey)
       # then proceed to the next.  If it is above the upperlimit, the
+n there
       # won't be any more within the array, so skip to next sitekey
      
          @{$sets{$fasta_id}[$setscounter]{$sitekey}} = grep {
            $_ >= $lowerlimit and $_ <= $upperlimit
            }  @{$matches{$fasta_id}{$sitekey}};
          
          unless (@{$sets{$fasta_id}[$setscounter]{$sitekey}}) {
            %{$sets{$fasta_id}[$setscounter]} = ();
            next WEEDLOW;
            }
          } 
 
       # a possibly redundant check to make sure we have matches from 
+ALL patterns
       
       if (scalar keys %{$sets{$fasta_id}[$setscounter]} < $num) {
         %{$sets{$fasta_id}[$setscounter]} = ();
         } 
 
       # Finally, if we are sure this is a hit, then increment setscou
+nter
       # to get the next set 
       
       $setscounter++ if scalar %{$sets{$fasta_id}[$setscounter]};
         
       # closes for my low
       }
    #closes for my site
    }
  
  # Originally, I was getting arrays of empty subhashes, since $setsco
+unter was
  # being initialized, though never having a subhash assigned to it.  
+Therefore, 
  # I threw this in.  If there is a subhash present as the last elemen
+t of the array,
  # the scalar call will come back with the bits being used (1/8, etc)
+.  Otherwise,
  # it will return 0.  If it's 0, we pop that out, and there's no arra
+y remaining.
  
  unless (defined @{$sets{$fasta_id}}) {
    delete $sets{$fasta_id};
    next WEEDFAST;
    }

  if (@{$sets{$fasta_id}}) {
    pop @{$sets{$fasta_id}} unless scalar %{$sets{$fasta_id}[$#{$sets{
+$fasta_id}}]};
    }

# this will reset the @fastarray array, since we have possibly removed
+ fasta_id keys from
# our hash of matches.

  if (@{$sets{$fasta_id}}) {
    push(@newfast,$fasta_id);
    }
  # similarly, we want to get rid of empty hash elements.  
  else {
    delete $sets{$fasta_id};
    }
  
  #closes for fastarray
  }
 
# as @fastarray is global, this will have global effect, and we do not
+ need to return it.

 @fastarray=@newfast;
 return %sets;
 #closes subroutine
 }
[download]

Thanks everybody
Matt

In reply to Help tightening up a subroutine please by mdunnbass

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.