You could try profiling with this. It does a bastardized binary chop on both ends to locate the passband and then steps back to locate the limits:
sub bandpass {
my( $aref, $loValue, $hiValue ) = @_;
return if $loValue > $aref->[-1] or $hiValue < $aref->[0];
my( $lo, $hi ) = ( 1, $#{ $aref } );
$lo += $lo while $aref->[$lo] < $loValue;
--$lo while $lo and $aref->[$lo-1] >= $loValue;
$hi >>= 1 while $aref->[$hi] > $hiValue;
++$hi while $hi < $#$aref and $aref->[$hi+1] <= $hiValue;
return @{ $aref }[ $lo .. $hi ];
}
...
my $aref = $matches{ $fasta_id }{ $sitekey };
$sets{ $fasta_id }[ $setscounter ]{ $sitekey } =
[ bandpass( $aref, $lowerlimit, $upperlimit ) ];
The bastardization is designed to handle non-unique and/missing values in the set and still locate the appropriate boundaries. If you know your values will always be unique that could simplify things a little. If you knew that the limits ($lowerlimit & $upperlimit) would always be contained within the set, that would simplify things considerably, but from what you've said elsewhere that is not the case.
Update: Further testing with more realistic test data showed the above to be both inefficient and flawed.
The problem with optimising this is the range of possible variations. Optimising for the general case is likely to do very badly for particular cases. Your original solution works well for narrow bands that are low in the range:
min---------loPass--hiPass------------------------------------max
But will perform badly for either of the following, where grep will perform quite well:
min-------------------------------------------loPass--hiPass--max
min--loPass-------------------------------------------hiPass--max
For the latter of the these two, scanning in from either end will be optimal as with my crude linear solution, but from what you've said about the datasets, it seems that your pass bands are quite narrow, but can be located anywhere within the entire range. And that is the hardest scenario to optimise for.
Here is another version that I believe to be both correct; and tailored towards being efficient for the type of data you've indicated you are working with. The revelation came to me that instead of using two binary chops to locate the two ends of the pass band--which is quite expensive as your limits may not appear in the dataset--I could use a single binary chop to locate a (single) value that lies within the passband and then use two linear searches out from that value to locate the lower and upper limits. If your pass bands are narrow, these linear searches will have little to do. They will also neatly handle the possibility of duplicate values which complicates binary chops.
sub bandpass {
my( $aref, $loValue, $hiValue ) = @_;
return if $loValue > $aref->[-1] or $hiValue < $aref->[0];
my( $lo, $hi ) = ( 0, $#{ $aref } );
while( $lo < $hi ) {
my $mid = int( ( $lo + $hi ) / 2 );
if( $aref->[ $mid ] >= $loValue and $aref->[ $mid ] <= $hiValu
+e ) {
$lo = $hi = $mid;
last;
}
elsif( $aref->[ $mid ] < $loValue ) {
$lo = $mid + 1;
}
elsif( $aref->[ $mid ] > $hiValue ) {
$hi = $mid - 1;
}
}
return if $loValue > $aref->[ $hi ] or $hiValue < $aref->[ $lo ];
--$lo while $lo and $aref->[ $lo - 1 ] >= $loValue;
++$hi while $hi < $#{ $aref } and $aref->[ $hi + 1 ] <= $hiValue;
return @{ $aref }[ $lo .. $hi ];
}
Again, no guarentees. It's very hard to do this kind of optimisation without real datasets to benchmark against.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|