Re: pattern match, speed problem

A quick check estimates that it will take around 1 hour 40 minutes to search a 30MB string for 1 million 10-char strings using index.

Instead of searching the whole string for each of the probes, it is far quicker to build a hash of the probes and then scan the string once, picking off 10-char substrings and testing to see if it exists in the hash:

#! perl -slw
use strict;
use constant {
    CHROM => 'chromosome.txt',
    PROBES => 'probes.txt',
};

my $chrom;
open C, '<', CHROM or die CHROM . ": $!";
sysread C, $chrom, -s( CHROM ) or die $!;
close C;
$chrom = uc $chrom;

my %probes;
open P, '<', PROBES or die $!;
chomp, undef $probes{ $_ } while <P>;
close P;

warn time;
my $p;
for ( 0 .. length( $chrom ) - 10 ) {
    exists $probes{ $p = substr( $chrom, $_, 10 ) } and print "$p : $_
+";
}
warn time;
[download]

On randomly generated data this method took just 71 seconds to test 30MB against 1e6 10-char probes. It required ~ 90MB of ram to build the hash.

If your probes are different lengths, then you would need multiple substrs, or a loop from min to max length, but it should still show a marked improvement in performance. Update: It does. With 1e6 probes of 8 through 12-chars each, it takes almost exactly 5 times as long for a total time of 335 seconds.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re: pattern match, speed problem Download Code