comment on

You could try profiling with this. It does a bastardized binary chop on both ends to locate the passband and then steps back to locate the limits:

sub bandpass {
    my( $aref, $loValue, $hiValue ) = @_;
    return if $loValue > $aref->[-1] or $hiValue < $aref->[0];
    my( $lo, $hi ) = ( 1, $#{ $aref } );
    $lo += $lo while                   $aref->[$lo]   <  $loValue;
    --$lo      while $lo           and $aref->[$lo-1] >= $loValue;
    $hi >>= 1  while                   $aref->[$hi]   >  $hiValue;
    ++$hi      while $hi < $#$aref and $aref->[$hi+1] <= $hiValue;
    return @{ $aref }[ $lo .. $hi ];
}
...
my $aref = $matches{ $fasta_id }{ $sitekey };
$sets{ $fasta_id }[ $setscounter ]{ $sitekey } = 
    [ bandpass( $aref, $lowerlimit, $upperlimit ) ];
[download]

The bastardization is designed to handle non-unique and/missing values in the set and still locate the appropriate boundaries. If you know your values will always be unique that could simplify things a little. If you knew that the limits ($lowerlimit & $upperlimit) would always be contained within the set, that would simplify things considerably, but from what you've said elsewhere that is not the case.

Update: Further testing with more realistic test data showed the above to be both inefficient and flawed.

The problem with optimising this is the range of possible variations. Optimising for the general case is likely to do very badly for particular cases. Your original solution works well for narrow bands that are low in the range:

min---------loPass--hiPass------------------------------------max
[download]

But will perform badly for either of the following, where grep will perform quite well:

min-------------------------------------------loPass--hiPass--max
min--loPass-------------------------------------------hiPass--max
[download]

For the latter of the these two, scanning in from either end will be optimal as with my crude linear solution, but from what you've said about the datasets, it seems that your pass bands are quite narrow, but can be located anywhere within the entire range. And that is the hardest scenario to optimise for.

Here is another version that I believe to be both correct; and tailored towards being efficient for the type of data you've indicated you are working with. The revelation came to me that instead of using two binary chops to locate the two ends of the pass band--which is quite expensive as your limits may not appear in the dataset--I could use a single binary chop to locate a (single) value that lies within the passband and then use two linear searches out from that value to locate the lower and upper limits. If your pass bands are narrow, these linear searches will have little to do. They will also neatly handle the possibility of duplicate values which complicates binary chops.

sub bandpass {
    my( $aref, $loValue, $hiValue ) = @_;
    return if $loValue > $aref->[-1] or $hiValue < $aref->[0];
    my( $lo, $hi ) = ( 0, $#{ $aref } );
    while( $lo < $hi ) {
        my $mid = int( ( $lo + $hi ) / 2 );
        if( $aref->[ $mid ] >= $loValue and $aref->[ $mid ] <= $hiValu
+e ) {
            $lo = $hi = $mid;
            last;
        }
        elsif( $aref->[ $mid ] < $loValue ) {
            $lo = $mid + 1;
        }
        elsif( $aref->[ $mid ] > $hiValue ) {
            $hi = $mid - 1;
        }
    }
    return if $loValue > $aref->[ $hi ] or $hiValue < $aref->[ $lo ];
    --$lo while $lo and $aref->[ $lo - 1 ] >= $loValue;
    ++$hi while $hi < $#{ $aref } and $aref->[ $hi + 1 ] <= $hiValue;
    return @{ $aref }[ $lo .. $hi ];
}
[download]

Again, no guarentees. It's very hard to do this kind of optimisation without real datasets to benchmark against.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^3: Help tightening up a subroutine please (Updated) by BrowserUk
in thread Help tightening up a subroutine please by mdunnbass

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.