Re: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?)

Turn the problem on its head and try it this way:

sub gen;
sub gen{
    return @_[1..$#_] if $_[0] == 1;
    map{
        my $p=$_;
        map{ $p . $_ } gen( $_[0]-1, @_[1..$#_] )
    } @_[1..$#_]
}

my %seqs = ...;
my @patterns = gen( 7, qw[A C G T] );

my %counts;

for my $seq ( values %seqs ) {
    ++$counts{ substr $seq, $_, 7 } for 0 .. length( $seq )-7;
}

print "$_ ::= $counts{ $_ }" for @patterns;
[download]

In my experiments on a 49 million base pairs sequence:

[ 0:15:31.00] C:\test\humanGenome>..\junk999 chr21.fa
16384 patterns.
49092500 base pairs
Using custom indexing found 35106546 matches; took 34.536852 seconds
Using custom index2 found 35106546 matches; took 31.354438 seconds
Simple search found 35106546 matches; took 2970.517883 seconds
[download]

it was close to 100 times faster than your current method. YMMV.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Re: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?) Select or Download Code

Replies are listed 'Best First'.
Re^2: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?) by anonym (Acolyte) on Jun 15, 2012 at 13:59 UTC
Hey Thanks. Actually when I tried running this, it did not print anything and I did not quite understand what @_1..$#_ does, the subroutine part.Thanks	[reply]
Re^3: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?) by BrowserUk (Patriarch) on Jun 15, 2012 at 15:26 UTC
Did you replace the `...` in `my %seqs = ...;` [download] With your code that loads the hash with your sequences? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^3: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?) by BrowserUk (Patriarch) on Jun 15, 2012 at 16:55 UTC
Here is a standalone demonstration that the code I posted works just fine: #! perl -slw use strict; use Time::HiRes qw[ time ]; sub gen{ return @_[1..$#_] if $_[0]==1; map{ my $p=$_; map{ $p . $_ } gen($_[0]-1, @_[1..$#_] ) } @_[1..$#_] } our $N //= 7; my %seqs = map { if( length() ) { my( $id, @seq ) = split "\n", $_; $id => join '', @seq; } else { () } } split '>', do{ local $/; uc( <DATA> ) }; my $start = time; my %counts; for my $seq ( values %seqs ) { ++$counts{ substr $seq, $_, $N } for 0 .. length( $seq ) -$N; } print "Took ", time - $start; print "$_ ::= ", $counts{ $_ } // 0 for gen( $N, qw[A C G T] ); __DATA__ > DNA1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA > DNA2 AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACA > DNA3 TTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCC GCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAAC CAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCA AAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACAA ATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGC AAGCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAG > DNA4 AGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTC CCCCTCAAACCTAAGAAATATGTCTGATAAAAGAGTTACTTTGATAGAGT AAATAATAGGAGCTTAAACCCCCTTATTTctaggactatgagaatcgaac ccatccctgagaatccaaaattctccgtgccacctatcacaccccatcct aAAGTAAGGTCAGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTG GTTATACCCTTCCCGTACTAATTAATCCCCTGGCCCAACCCGTCATCTAC [download] And some output: `C:\test>976237-2 -N=2 Took 0.000387907028198242 AA ::= 97 AC ::= 86 AG ::= 49 AT ::= 84 CA ::= 99 CC ::= 130 CG ::= 26 CT ::= 71 GA ::= 35 GC ::= 43 GG ::= 27 GT ::= 36 TA ::= 85 TC ::= 68 TG ::= 39 TT ::= 71` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^3: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?) by BrowserUk (Patriarch) on Jun 15, 2012 at 14:42 UTC
Oh sorry. It must be broken.	[reply]