in reply to Re^4: counting the number of 16384 pattern matches in a large DNA sequence
in thread counting the number of 16384 pattern matches in a large DNA sequence
Unless I'm missing something, ... (as long as it's not necessary to match overlapping matches):
You hit the nail on the head. You'll only match 5,015,229 times when the OPs code matches 35,106,546 times.
However, with a modification to your regex, you can avoid that problem and find overlapping matches:
++$index{ $1 } while $$rSeq =~ m[(?=([ACGT]{7}))]g;
But it is still much slower than avoiding the regex engine completely:
[ 6:55:26.00] C:\test\humanGenome>..\976237 chr21.fa 16384 Using custom indexing found 35106546 matches; took 31.611258 seconds Using custom index2 found 35106546 matches; took 27.504099 seconds Using custom index3 found 35106546 matches; took 27.571143 seconds Using quantified charclass lookahead found 35106546 matches; took 49.9 +54810 seconds
But ++ for thinking outside the box. (I can't believe I actually used that phrase :)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^6: counting the number of 16384 pattern matches in a large DNA sequence
by salva (Canon) on Jun 15, 2012 at 07:22 UTC | |
|
Re^6: counting the number of 16384 pattern matches in a large DNA sequence
by aaron_baugher (Curate) on Jun 15, 2012 at 16:28 UTC | |
by BrowserUk (Patriarch) on Jun 15, 2012 at 16:37 UTC |