in reply to pattern match, speed problem

A quick check estimates that it will take around 1 hour 40 minutes to search a 30MB string for 1 million 10-char strings using index.

Instead of searching the whole string for each of the probes, it is far quicker to build a hash of the probes and then scan the string once, picking off 10-char substrings and testing to see if it exists in the hash:

#! perl -slw use strict; use constant { CHROM => 'chromosome.txt', PROBES => 'probes.txt', }; my $chrom; open C, '<', CHROM or die CHROM . ": $!"; sysread C, $chrom, -s( CHROM ) or die $!; close C; $chrom = uc $chrom; my %probes; open P, '<', PROBES or die $!; chomp, undef $probes{ $_ } while <P>; close P; warn time; my $p; for ( 0 .. length( $chrom ) - 10 ) { exists $probes{ $p = substr( $chrom, $_, 10 ) } and print "$p : $_ +"; } warn time;

On randomly generated data this method took just 71 seconds to test 30MB against 1e6 10-char probes. It required ~ 90MB of ram to build the hash.

If your probes are different lengths, then you would need multiple substrs, or a loop from min to max length, but it should still show a marked improvement in performance. Update: It does. With 1e6 probes of 8 through 12-chars each, it takes almost exactly 5 times as long for a total time of 335 seconds.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."