Re^2: counting the number of 16384 pattern matches in a large DNA sequence

Thanks for the reply. One of the inputs is a file of 16384 strings given in below example format:

AAAAAAA
AAAAAAT
AAAAAAG
AAAAAAC
AAAAATA
AAAAATT
AAAAATG
AAAAATC
AAAAAGA
AAAAAGT
AAAAAGG
[download]

Another input file is a multi fasta file having sequences for each chromosome like this:

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
[download]

so i have stored the sequence for each chromosome in a hash with hash key as chromosome number and concatenated sequence as value.The number of lines in the multifasta file is 61913917.Thanks

Comment on Re^2: counting the number of 16384 pattern matches in a large DNA sequence Select or Download Code

Replies are listed 'Best First'.
Re^3: counting the number of 16384 pattern matches in a large DNA sequence by BrowserUk (Patriarch) on Jun 14, 2012 at 17:40 UTC
`AAAAAAA AAAAAAT AAAAAAG AAAAAAC` [download] Are all your patterns the same length? Are they all upper case? Are you looking for exact (including case) matches only? Is it possible to obtain a copy of the patterns file? A small extract from a fasta file I have kicking around (HG:chr20): `... NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCCAgaggtggaagaggaaggaagcttggaaccctatagagttgctga gtgccaggaccagatcctggccctaaacaggtggtaaggaaggagagagt gaaggaactgccaggtgacacactcccaccatggacctctgggatcctag ctttaagagatcccatcacccacatgaacgtttgaattgacagggggagc ...` [download] index is usually faster for matching constant strings, but if you you want case independent matches, you would need to uc the sequences before searching (and studying). With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^4: counting the number of 16384 pattern matches in a large DNA sequence by anonym (Acolyte) on Jun 14, 2012 at 20:16 UTC
Thanks. Yes, the patterns are of same length and the search is case sensitive.	[reply]
Re^5: counting the number of 16384 pattern matches in a large DNA sequence by BrowserUk (Patriarch) on Jun 14, 2012 at 20:31 UTC
How long does your current method take? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^3: counting the number of 16384 pattern matches in a large DNA sequence by kennethk (Abbot) on Jun 14, 2012 at 16:47 UTC
Please see How do I post a question effectively? Once you have a discrete test case that you have run through Devel::NYTProf, and as well run test cases with Benchmark for versions both including and excluding study, we can help you understand your results. You may also want to check out index, since you the regex engine is more power than you really need. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply]
Re^4: counting the number of 16384 pattern matches in a large DNA sequence by anonym (Acolyte) on Jun 14, 2012 at 17:30 UTC
Is there any other way instead of storing the matches in an array @matches to make the code faster? Thanks	[reply]
Re^5: counting the number of 16384 pattern matches in a large DNA sequence by kennethk (Abbot) on Jun 14, 2012 at 17:56 UTC
I would expect swapping `my @matches = ($seq{$key} =~ /$string/ig); #print "@matches\n"; $result{$string} = scalar(@matches);` [download] to `$result{$string} =()= ($seq{$key} =~ /$string/ig);` would hit the spec of this request, but I would also expect the speed-up achieved would be negligible. The compound of two `=` operators with an empty list (`()`) puts the regex in list context, thus obtaining a count of all matches. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^6: counting the number of 16384 pattern matches in a large DNA sequence by anonym (Acolyte) on Jun 14, 2012 at 20:20 UTC
Re^7: counting the number of 16384 pattern matches in a large DNA sequence by kennethk (Abbot) on Jun 14, 2012 at 22:26 UTC
Re^3: counting the number of 16384 pattern matches in a large DNA sequence by Cristoforo (Curate) on Jun 14, 2012 at 20:12 UTC
One of the inputs is a file of 16384 strings given in below example format: Given a string of 7 fasta characters, you can form 4^7= 16384 unique strings from T, A, G, C. So, his algorithm is checking every possible 7 char fasta string against the mutifasta file (number of lines in the multifasta file is 61913917).	[reply]
Re^4: counting the number of 16384 pattern matches in a large DNA sequence by aaron_baugher (Curate) on Jun 15, 2012 at 05:27 UTC
In that case, it seems like a fairly simple regex should do it. I'd think one regex would be faster than using `index` to check 16384 different substrings, but a benchmark would tell for sure. I'm also not sure why he's reading the entire huge file into a hash; it seems like that could be running him into swap, slowing things down severely. Unless I'm missing something, it seems like this would work (as long as it's not necessary to match overlapping matches): `#!/usr/bin/env perl use Modern::Perl; my %c; while(<DATA>){ chomp; while(/([ACGT]{7})/g){ $c{$1}++; } } say "$_ : $c{$_}" for sort keys %c; __DATA__ NNNAGTACANNNNTAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNNN NNNAGTCCCACANNNNTAAAAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNN +N NNNAGTACANNNNTAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNNN` [download] Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l] [select]
Re^5: counting the number of 16384 pattern matches in a large DNA sequence by BrowserUk (Patriarch) on Jun 15, 2012 at 06:09 UTC
Unless I'm missing something, ... (as long as it's not necessary to match overlapping matches): You hit the nail on the head. You'll only match 5,015,229 times when the OPs code matches 35,106,546 times. However, with a modification to your regex, you can avoid that problem and find overlapping matches: `++$index{ $1 } while $$rSeq =~ m[(?=([ACGT]{7}))]g;` [download] But it is still much slower than avoiding the regex engine completely: `[ 6:55:26.00] C:\test\humanGenome>..\976237 chr21.fa 16384 Using custom indexing found 35106546 matches; took 31.611258 seconds Using custom index2 found 35106546 matches; took 27.504099 seconds Using custom index3 found 35106546 matches; took 27.571143 seconds Using quantified charclass lookahead found 35106546 matches; took 49.9 +54810 seconds` [download] But ++ for thinking outside the box. (I can't believe I actually used that phrase :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^6: counting the number of 16384 pattern matches in a large DNA sequence by salva (Canon) on Jun 15, 2012 at 07:22 UTC
Re^6: counting the number of 16384 pattern matches in a large DNA sequence by aaron_baugher (Curate) on Jun 15, 2012 at 16:28 UTC
Re^7: counting the number of 16384 pattern matches in a large DNA sequence by BrowserUk (Patriarch) on Jun 15, 2012 at 16:37 UTC