Re: pattern match, speed problem
by BrowserUk (Patriarch) on Feb 20, 2008 at 08:48 UTC
|
A quick check estimates that it will take around 1 hour 40 minutes to search a 30MB string for 1 million 10-char strings using index.
Instead of searching the whole string for each of the probes, it is far quicker to build a hash of the probes and then scan the string once, picking off 10-char substrings and testing to see if it exists in the hash:
#! perl -slw
use strict;
use constant {
CHROM => 'chromosome.txt',
PROBES => 'probes.txt',
};
my $chrom;
open C, '<', CHROM or die CHROM . ": $!";
sysread C, $chrom, -s( CHROM ) or die $!;
close C;
$chrom = uc $chrom;
my %probes;
open P, '<', PROBES or die $!;
chomp, undef $probes{ $_ } while <P>;
close P;
warn time;
my $p;
for ( 0 .. length( $chrom ) - 10 ) {
exists $probes{ $p = substr( $chrom, $_, 10 ) } and print "$p : $_
+";
}
warn time;
On randomly generated data this method took just 71 seconds to test 30MB against 1e6 10-char probes. It required ~ 90MB of ram to build the hash.
If your probes are different lengths, then you would need multiple substrs, or a loop from min to max length, but it should still show a marked improvement in performance. Update: It does. With 1e6 probes of 8 through 12-chars each, it takes almost exactly 5 times as long for a total time of 335 seconds.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: pattern match, speed problem
by hipowls (Curate) on Feb 20, 2008 at 06:42 UTC
|
my $position = index $chrom, $_;
if ( $position != -1 ) {
# we matched
}
| [reply] [d/l] |
|
|
And the large string needs to be uppercased, since the pattern match was case insensitive.
| [reply] |
|
|
perl -pe'tr/acgt/ACGT/' -i big_string_file
or
perl -pe'tr/ACGT/acgt/' -i probe_string_file
| [reply] [d/l] [select] |
Re: pattern match, speed problem
by moritz (Cardinal) on Feb 20, 2008 at 10:59 UTC
|
Instead of iterating over the probes, build a single regex:
my $re = join '|', @probes;
$re = qr{(?:$re)}; # just in case you want to use it
# somewhere else, and now it's compiled.
And make sure that you use perl 5.10, which has a special optimization in the regex engine for alternations of fixed strings.
Note that this won't find overlapping match results, if you need them, you have to fiddle with pos (and perhaps \G in the regex) manually.
It will also modify the order in which the matches are returned, but if you need the original order very badly, you can find ways around it. | [reply] [d/l] [select] |
|
|
It's a seductive idea, but it doesn't seem to scale. With 1000 probes and 30MB it take sunder 2 seconds. With 10,000 probes it takes 22 seconds; 11,000 - 23 secs; 12,000 - 24 secs. Everything looks good. But then with 13,000 I left it running for nearly 10 minutes and it still had not completed and the output seemed to have redcued to a trickle?
I also tried 15,000 & 20,000 in case I had hit some odd pathelogical case, but both never seem to complete?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
Did you try it with perl 5.10 or newer? (sorry, don't have one here available to test it).
Anyway, you could join just 1000 probes into one regex, and repeat the process until all probes are tested against. That's certainly much faster than single runs.
Update: I thougt a bit more about it, and got the following idea: if you have many probes, and use a fixed number of probes per regex, you can try to sort the probes before building the buckets. That way you can play to the strengths of the re engine and its trie optimization.
| [reply] |
|
|
Re: pattern match, speed problem
by stiller (Friar) on Feb 20, 2008 at 08:39 UTC
|
index has been suggested, also if case is not interesting, make both inputfiles all upper or all lower case only once is much faster than ignoring case for each try. This problem also lends itself nicely to concurrency, if you have hardware for it (e.g. several machines, several processors and/or kernels. Divide the probes in a number of partitions, and set off a process seeking for each partition of the problem. Also, it seems likely that somebody has optimized this kind of search. For example, if the only letters are C,G,A and T, then your full 8 bit string is taking a lot more space and time to seach that if you convert it to 2 bit/byte (eg 00 = C, 01 = G, 10 = A, 11 = T). But this is not my area... hth
| [reply] |
Re: pattern match, speed problem
by ysth (Canon) on Feb 20, 2008 at 07:41 UTC
|
Your /g flag is making each match start looking only after where the previous
one left off. Is this intentional?
| [reply] |
Re: pattern match, speed problem
by bobf (Monsignor) on Feb 21, 2008 at 05:09 UTC
|
index has already been mentioned, but I wanted to comment on this:
Only one hit is possible for each 'probe' string
It is possible that the probes have already been screened for uniqueness, but based on a frequency of 0.25 for each base (on average and subject to inter-species variation, of course) a 10-mer would occur by chance once every 1,048,576 bases (4**10). Your chromosomes are 30 million bases long, so in the absence of other information I'd expect each probe to match quite a few times on each chromosome.
That said, note that the third parameter of index sets the start position for searching the string. To find all matches you'll have to use index iteratively, m//g, or (the approach I prefer, BrowserUk++) create an index of the chromosomes before searching for your 1 million probes.
| [reply] [d/l] |