Here's my attempt at refactoring your WEED() sub. As I don't have data, this has been done blind.
Ie. Without the tests you would normally use to ensure that it continues to function correctly
and that changes aimed at improving performance actually achieve that goal.
So, I'll step through the changes I've made, so that you can apply and test them as you go. It should also give you some ideas for refactoring the other subs that need it.
The first thing I did was to do some trivial reformatting of the code to make the indentation consistent, add some horizontal whitespace and reduce the vertical whitespace.
These changes allow me to get a clearer overview of the structure of the sub and (I find) make it easier to read. Along the way, I've also remove the comments as I find them a distraction.
Obviously, you don't need to make these changes,
One anomaly that jumped out at me as I manually reformatted the code I got from your scratchpad was this line:
WEEDLOW: for my $low ( @{ $matchesfasta_id{ $site } } ) {
As there is no other reference to a variable named $matchesfasta_id in the file, and there is a line a couple above this line:
last unless @{$matches{$fasta_id}{$site}};
I've assumed that this is a typo/transcription error and should read:
WEEDLOW: for my $low ( @{ $matches{ $fasta_id }{ $site }
+}) {
I've posted my starting point after the reformatting below:
The first things I did were simple clean ups of various things in the code. Starting from the top:
- my( $site, $fasta, $sitekey) = '';
The first and last of these declared variables are used as loop iterators, so it is much better to declare them in-line with the loops.
The middle one is not used and can be done away with.
- my( $setscounter, $lowerlimit, $upperlimit, $i, $set, $yes ) = 0;
The first three of these are used only within the loops and are never referred to outside of the scope at which they first appear, so they again can be declared inline with their first usage.
The last three are not used anywhere in the sub and can be dropped.
- my %sets = (); Lexical variables are initialised when they come into being, so the initialisation is unnecessary
and simple clutters the code.
Now I started looking at the function of the code and the first thing I noticed is that you sort the keys of %matches
in the for loop at A below, and again each time around the nested inner loop at B.
for my $site ( sort { $a <=> $b } keys %{ $matches{ $fasta_id } } ) {
+ ## A
last unless @{ $matches{ $fasta_id }{ $site } };
WEEDLOW: for my $low ( @{ $matches{ $fasta_id }{ $site } } ) {
my $lowerlimit = $low + 0;
my $upperlimit = $span + $lowerlimit;
for my $sitekey ( sort { $a <=> $b } keys %{ $matches{ $fasta_
+id } } ) { ## B
Since you are not modifying %matches between these two points (or at all), then it means that you are sorting and re-sorting these keys many times.
It's not easy to tell in the absence of test data how big this hash is, or how many time these loops iterate,
but given the sizes of the data files mentioned in the OP, and the timing information you've supplied above, it appears that they may be substantial.
I could well see this needless re-sorting being the source of a substantial amount of the runtime of the program.
Eliminating this duplication is easy and should produce a substantial performance benefit. Replacing the above lines with these should do the trick:
my @sortedKeys = sort { $a <=> $b } keys %{ $matches{ $fasta_i
+d } };
for my $site ( @sortedKeys ) {
last unless @{ $matches{ $fasta_id }{ $site } };
WEEDLOW: for my $low ( @{ $matchesfasta_id{ $site } } ) {
my $lowerlimit = $low + 0;
my $upperlimit = $span + $lowerlimit;
for my $sitekey ( @sortedKeys ) {
The next thing I looked at was the section of code where you 'band pass' filter the matches into a temporary array.
You then conditionally copy that temporary array into your results array, or discard the entire super set if nothing made it through the filter:
for my $hit ( @{ $matches{ $fasta_id }{ $sitekey } } ) {
next unless $hit >= $lowerlimit;
last unless $hit <= $upperlimit;
push @arrayA, $hit + 0;
}
if( @arrayA ) {
@{ $sets{ $fasta_id }[ $setscounter ]{ $sitekey } } = @arrayA;
}
else {
%{ $sets{ $fasta_id }[ $setscounter ] } = ();
}
Whilst there is nothing inherently wrong with this, Perl has a general purpose array filtering mechanism,
namely grep specifically for this purpose and it is usually considerably quicker than an explicit loop:
@{ $sets{ $fasta_id }[ $setscounter ]{ $sitekey } } = grep {
$_ >= $lowerlimit and $_ <= $upperlimit
} @{ $matches{ $fasta_id }{ $sitekey } }
unless( @{ $sets{ $fasta_id }[ $setscounter ]{ $sitekey } } ) {
undef $sets{ $fasta_id }[ $setscounter ] };
next WEEDLOW;
}
There is a possible caveat with this change!
Your explicit loop does allow you to short circuit the loop at the high pass limit which isn't possible with grep.
However, as shown above, using grep does allow you to avoid the creation of and later copying of the temporary array.
This combined with greps inherent better performance may outweigh the benefit of that short circuiting. Or it may not.
The only way to tell will be to make the change and benchmark.
If the size of the array being filtered is sufficiently large, and the pass band sufficiently narrow and low down in the range
that the short circuit is beneficial, then it would be better to use a binary search algorithm to discover the low and high range limits
and then copy the values across to the results set using an array slice.
I haven't coded that as I have no way to make the test.
Finally (for now), you use this construct:
%{ $sets{ $fasta_id }[ $setscounter ] } = ();
To empty arrays in several places. Whilst there is nothing wrong with this, I think it is clearer, and may be slightly quicker to use
undef $sets{ $fasta_id }[ $setscounter ] };
and I've made those changes also.
The final code is the second of the two code blocks below.
There are various other things that look suspect.
|