in reply to Re^3: counting the number of 16384 pattern matches in a large DNA sequence
in thread counting the number of 16384 pattern matches in a large DNA sequence
In that case, it seems like a fairly simple regex should do it. I'd think one regex would be faster than using index to check 16384 different substrings, but a benchmark would tell for sure. I'm also not sure why he's reading the entire huge file into a hash; it seems like that could be running him into swap, slowing things down severely. Unless I'm missing something, it seems like this would work (as long as it's not necessary to match overlapping matches):
#!/usr/bin/env perl use Modern::Perl; my %c; while(<DATA>){ chomp; while(/([ACGT]{7})/g){ $c{$1}++; } } say "$_ : $c{$_}" for sort keys %c; __DATA__ NNNAGTACANNNNTAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNNN NNNAGTCCCACANNNNTAAAAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNN +N NNNAGTACANNNNTAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNNN
Aaron B.
Available for small or large Perl jobs; see my home node.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: counting the number of 16384 pattern matches in a large DNA sequence
by BrowserUk (Patriarch) on Jun 15, 2012 at 06:09 UTC | |
by salva (Canon) on Jun 15, 2012 at 07:22 UTC | |
by aaron_baugher (Curate) on Jun 15, 2012 at 16:28 UTC | |
by BrowserUk (Patriarch) on Jun 15, 2012 at 16:37 UTC |