Unless I'm missing something, ... (as long as it's not necessary to match overlapping matches):
You hit the nail on the head. You'll only match 5,015,229 times when the OPs code matches 35,106,546 times.
However, with a modification to your regex, you can avoid that problem and find overlapping matches:
++$index{ $1 } while $$rSeq =~ m[(?=([ACGT]{7}))]g;
But it is still much slower than avoiding the regex engine completely:
[ 6:55:26.00] C:\test\humanGenome>..\976237 chr21.fa 16384 Using custom indexing found 35106546 matches; took 31.611258 seconds Using custom index2 found 35106546 matches; took 27.504099 seconds Using custom index3 found 35106546 matches; took 27.571143 seconds Using quantified charclass lookahead found 35106546 matches; took 49.9 +54810 seconds
But ++ for thinking outside the box. (I can't believe I actually used that phrase :)
In reply to Re^5: counting the number of 16384 pattern matches in a large DNA sequence
by BrowserUk
in thread counting the number of 16384 pattern matches in a large DNA sequence
by anonym
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |