in reply to counting the number of 16384 pattern matches in a large DNA sequence
Turn the problem on its head and try it this way:
sub gen; sub gen{ return @_[1..$#_] if $_[0] == 1; map{ my $p=$_; map{ $p . $_ } gen( $_[0]-1, @_[1..$#_] ) } @_[1..$#_] } my %seqs = ...; my @patterns = gen( 7, qw[A C G T] ); my %counts; for my $seq ( values %seqs ) { ++$counts{ substr $seq, $_, 7 } for 0 .. length( $seq )-7; } print "$_ ::= $counts{ $_ }" for @patterns;
In my experiments on a 49 million base pairs sequence:
it was close to 100 times faster than your current method. YMMV.[ 0:15:31.00] C:\test\humanGenome>..\junk999 chr21.fa 16384 patterns. 49092500 base pairs Using custom indexing found 35106546 matches; took 34.536852 seconds Using custom index2 found 35106546 matches; took 31.354438 seconds Simple search found 35106546 matches; took 2970.517883 seconds
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?)
by anonym (Acolyte) on Jun 15, 2012 at 13:59 UTC | |
by BrowserUk (Patriarch) on Jun 15, 2012 at 15:26 UTC | |
by BrowserUk (Patriarch) on Jun 15, 2012 at 16:55 UTC | |
by BrowserUk (Patriarch) on Jun 15, 2012 at 14:42 UTC |