Turn the problem on its head and try it this way:
sub gen; sub gen{ return @_[1..$#_] if $_[0] == 1; map{ my $p=$_; map{ $p . $_ } gen( $_[0]-1, @_[1..$#_] ) } @_[1..$#_] } my %seqs = ...; my @patterns = gen( 7, qw[A C G T] ); my %counts; for my $seq ( values %seqs ) { ++$counts{ substr $seq, $_, 7 } for 0 .. length( $seq )-7; } print "$_ ::= $counts{ $_ }" for @patterns;
In my experiments on a 49 million base pairs sequence:
it was close to 100 times faster than your current method. YMMV.[ 0:15:31.00] C:\test\humanGenome>..\junk999 chr21.fa 16384 patterns. 49092500 base pairs Using custom indexing found 35106546 matches; took 34.536852 seconds Using custom index2 found 35106546 matches; took 31.354438 seconds Simple search found 35106546 matches; took 2970.517883 seconds
In reply to Re: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?)
by BrowserUk
in thread counting the number of 16384 pattern matches in a large DNA sequence
by anonym
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |