Re^7: Is it possible to make reference to substrings and sort them?

I have to find all ready that match with no mismatch

If you're only looking for exact matches, something like this does the job in ~ 7 seconds (including building the index):

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

our $MIN        //= 5;
our $SHORT      //= 20;
our $NSHORTS    //= 10e3;
our $BIG        //= 5e6;
our $S          //= 0;

$S and srand( $S );
sub rndDna{ my $dna =''; $dna .= ( qw[A C G T] )[ rand 4 ] for 1 .. $_
+[0]; return $dna; }

my $bigDna = rndDna( $BIG );
my @shorts = map rndDna( $SHORT ), 1 .. $NSHORTS;

my $start = time;
my %idx; push @{ $idx{ substr $bigDna, $_, $MIN } }, $_ for 0 .. lengt
+h( $bigDna ) - $MIN;
printf STDERR "Indexing took %.9f secs\n", time() -$start ;

$start = time;
my $found = 0;
for my $short ( @shorts ) {
    for my $pos ( @{ $idx{ my $pre = substr $short, 0, $MIN } } ) {
        if( substr( $bigDna, $pos, $SHORT ) eq $short ) {
            print "$short found at $pos";
            ++$found;
        }
    }
}

print STDERR "Found: ", $found;
printf STDERR "Lookup of $NSHORTS $SHORT-nt short reads against a $BIG
+-nt sequence took %.9f secs\n", time() -$start ;

__END__
C:\test>1120869-2 -BIG=5e6 -MIN=5 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 3.365217924 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 25.56
+8705082 secs
    182

C:\test>1120869-2 -BIG=5e6 -MIN=6 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 3.351562023 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 6.666
+512966 secs
    182

C:\test>1120869-2 -BIG=5e6 -MIN=7 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 4.609364986 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 1.699
+807882 secs
    182

C:\test>1120869-2 -BIG=5e6 -MIN=7 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 4.519516945 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 1.682
+117939 secs
    182

C:\test>1120869-2 -BIG=5e6 -MIN=8 -SHORT=14 -NSHORTS=10000 -S=1 | wc -
+l
Indexing took 6.175763845 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 0.424
+978018 secs
    182

C:\test>1120869-2 -BIG=5e6 -MIN=10 -SHORT=14 -NSHORTS=10000 -S=1 | wc 
+-l
Indexing took 9.214743853 secs
Found: 182
Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 0.058
+255196 secs
    182
[download]

Note: I'm using 14nt short reads because any longer and none are found in my randomly generated test sequence.

You can reduce the lookup time by increasing number of characters you index; with the trade off that the indexing take longer and uses more memory:

1 x 5e6nt sequence; 10e3 x 14nt short reads; 
Index prefix:   indexing(secs):   lookup(secs): found:   memory(MB):
 5              3.37              25.57         182      200
 6              3.38               6.66         182      210
 7              4.06               1.69         182      230
 8              6.17               0.42         182      244
10              9.21               0.05         182      450
[download]

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Comment on Re^7: Is it possible to make reference to substrings and sort them? Select or Download Code