Commander Salamander has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE: I tried InjunJoel's approach, and it blazed through all 6.5 million bases very quickly. I now see that using regexps for this type of problem is not the way to go... could someone perhaps explain to me precisely why the regexp approach that I initially tried brought my machine to a screeching halt? I presume it is has something to do with the way it stores the string in memory while traversing it with the regexp....

******************************************************
Hello,

I have a massive DNA sequence ( a string of ~600,000 (CORRECTION: 6.5million) "A"s "G"s "C"s and "T"s ) and I need to count the number of occurences of every possible 1, 2, and 3 letter combination in this sequence. I'm trying to use the following approach:

while ( $seq =~ /$term/g ) { $count++; }

Unfortunately, this is proving prohibitively slow. I would greatly appreciate any advice on how to speed this up (I tried adding /o, but it made little to no difference).

Also, while I have your ear, I was wondering if anyone could tell me how I can use a variable in a tr// regexp?

Thanks!

Replies are listed 'Best First'.
Re: Question about speeding a regexp count
by sauoq (Abbot) on Oct 13, 2005 at 18:58 UTC
    I need to count the number of occurences of every possible 1, 2, and 3 letter combination in this sequence.

    I'd use substr() and build up a hash as I made one pass through the string.

    how I can use a variable in a tr// regexp?

    You must use eval.

    Update: Code for your first question:

    my $length = length $string; my %seen; for my $i (0 .. $length - 1) { $seen{ substr($string, $i, 1) } ++; $seen{ substr($string, $i, 2) } ++ if $i < $length - 1; $seen{ substr($string, $i, 3) } ++ if $i < $length - 2; }
    Assumes your data is in $string.

    -sauoq
    "My two cents aren't worth a dime.";
    

      This is probably faster:

      my %count; my $length = length $seq; for my $i (0 .. $length - 3) { $count{ substr($seq, $i, 1) }++; $count{ substr($seq, $i, 2) }++; $count{ substr($seq, $i, 3) }++; } $count{ substr($seq, $length - 1, 1) }++; $count{ substr($seq, $length - 2, 1) }++; $count{ substr($seq, $length - 2, 2) }++;

      Renamed %seen to the more appropriate %count. Renamed $string to $seq to match the OP.

        This is probably faster:

        Don't forget "less maintainable."

        How many lines do you have to add or change when someone comes around next week and asks you to look at substring lengths up to 10 characters long?

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: Question about speeding a regexp count
by Limbic~Region (Chancellor) on Oct 13, 2005 at 19:25 UTC
    Commander Salamander,
    This trades memory for speed but still ran fine on my 256MB machine with a 600_000 byte DNA strand.
    my $dna = join '', map { chomp; $_ } <DATA>; my $template = ('AXA2X2A3X2' x (length($dna) - 2)) . 'AXA2XA'; my %count; $count{$_}++ for unpack $template, $dna; print "$_\t$count{$_}\n" for keys %count; __DATA__ AAAAAAAACAAGAATACACAACCACGACTAGAGAAGCAGGAGTATATAATCATGATTCCACAACACCAGC +ATCCCCACCCCCGCCTCGCGACGCCGGCGT CTCTACTCCTGCTTGGAGAAGACGAGGATGCGCAGCCGCGGCTGGGGAGGCGGGGGTGTGTAGTCGTGGT +TTTATAATACTAGTATTCTCATCCTCGTCT TGTGATGCTGGTGTTTTTATTCTTGTTTAACACAACCACTAGAGCAGTATATAATCCCACACCAGCCCCC +CCTCGCGACGGCGTCTCTACTCCTGGGAGA CGAGGATGCGCAGCGGCTGGGGAGGGGTGTAGTCTTATACTAGTATTCTCCTCGTCTTGTGATGCTGGAC +TGGGGTCGATCGTCGAAATCGGCTAGCTAA AAAAAAACAAGAATACACAACCACGACTAGAGAAGCAGGAGTATATAATCATGATTCCACAACACCAGCA +TCCCCACCCCCGCCTCGCGACGCCGGCGTC TCTACTCCTGCTTGGAGAAGACGAGGATGCGCAGCCGCGGCTGGGGAGGCGGGGGTGTGTAGTCGTGGTT +TTATAATACTAGTATTCTCATCCTCGTCTT GTGATGCTGGTGTTTTTATTCTTGTTTAACACAACCACTAGAGCAGTATATAATCCCACACCAGCCCCCC +CTCGCGACGGCGTCTCTACTCCTGGGAGAC
    If memory becomes an issue, you could walk the length of the string using substr or unpack. It will require keeping track of your position in the string but both functions provide a means for starting "inside" the string.

    Cheers - L~R

    Update: After testing the memory consumption, I updated the post to reflect that it may be realistic to use this approach

Re: Question about speeding a regexp count
by BrowserUk (Patriarch) on Oct 13, 2005 at 19:50 UTC

    Add this to your considerations. It does 600_000 char sequence in 3.39 seconds.

    Updated to correct caveat.

    #! perl -slw use strict; use List::Util qw[sum]; use Time::HiRes qw[ time ]; my $sequence = join'', map{ ( qw[ A C G T ] )[ rand 4 ] } 1 .. 600_000 +; my $start = time; my( %one, %two, %three ); $one{ $_ }++ for unpack '(A1)*', $sequence; $two{ $_ }++ for unpack '(A2X)*', $sequence; delete @two{ (qw'A C G T') }; $three{ $_ }++ for unpack '(A3XX)*', $sequence; delete @three{ (qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT') } +; printf "Elapsed: %3f seconds\n", time - $start; print sum values %one; print sum values %two; print sum values %three; <STDIN>; print "$_ => $one{ $_ }" for keys %one;; print "$_ => $two{ $_ }" for keys %two;; print "$_ => $three{ $_ }" for keys %three;; __END__ P:\test>499988 Elapsed: 3.390625 seconds 600000 599999 599998

    It has one caveat. Where the sequence is not a multiple of 2 or 3 characters, the last 1 or two chars will get counted. This is easily filtered.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      Hi everyone, I'm about to test some of these suggestions, but I just thought I'd add a slight correction to the original posting: My number of bases is actually 6.5 million. Thanks for all the feedback!

        A little over 10x the data takes a little over 10x as long, 36 seconds, though it needed a few tweaks to stop it from blowing gobs of memory:

        #! perl -slw use strict; use Time::HiRes qw[ time ]; my $sequence = join'', map{ ( qw[ A C G T ] )[ rand 4 ] } 1 .. 100_000 +; $sequence x= 65; my $start = time; my( %one, %two, %three ); print 'starting one'; for( my $i=0; $i < length( $sequence ) - 100_000; $i += 100_000 ) { $one{ $_ }++ for unpack '(A1)*', substr $sequence, $i, 100_000; } print 'starting two'; for( my $i=0; $i < length( $sequence ) - 99_999; $i += 99_999 ) { $two{ $_ }++ for unpack '(A2X)*', substr $sequence, $i, 100_000; } delete @two{ (qw'A C G T') };; print 'starting three'; for( my $i=0; $i < length( $sequence ) - 99_999; $i += 99_999 ) { $three{ $_ }++ for unpack '(A3XX)*', substr $sequence, $i, 100_002 +; } delete @three{ (qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT') } +; printf "Elapsed: %3f seconds\n", time - $start; <STDIN>; print "one\n@{[%one]}"; print "two\n@{[%two ]}"; print "three\n@{[%three]}"; __END__ P:\test>499988 starting one starting two starting three Elapsed: 36.257410 seconds one A 1604992 T 1613184 C 1589696 G 1592128 two AC 404168 CC 403645 TG 409301 AT 408133 AA 413074 CT 409560 CG 399941 +TA 409562 GC 400133 CA 401370 GT 407740 AG 404684 TC 406569 GA 406053 TT 412940 +GG 403062 three GCC 97240 AGT 100880 TGT 105820 TGA 99385 CGA 99580 ATC 103155 AAC 102 +765 AGC 100490 TAC 99060 TCG 100490 ACA 96460 CTG 103155 CCG 97500 GCA 986 +70 GTG 100230 AAG 100750 CAC 98995 GTT 105820 AGA 102310 ACC 101660 CCA 1 +03480 TGG 101985 CGC 98475 CTC 98995 TTG 105300 TAA 102700 CAG 100880 ACG 10 +1400 AAA 105365 ATG 100620 GTA 100360 CTT 102570 TAG 104650 GGA 104780 GTC +101335 TGC 102115 TCA 102765 ATT 102765 TAT 103155 AAT 104195 ACT 104650 GAC +103350 CAA 100230 GGT 98865 TCC 104130 TTT 101790 AGG 101010 CGT 102180 CGG 9 +9710 CAT 101270 ATA 101595 CCC 100620 GGG 100360 GAG 98410 TTA 102765 GAT 9 +9515 CTA 104845 TCT 99190 TTC 103090 GCG 100555 GGC 99060 GAA 104780 GCT 10 +3675 CCT 102050

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
        I just compared my "a1" algorithem with the slowesr ;-) so far for 6_500_000 characters:

        the unpack code took:988 wallclock secs (102.42 usr + 27.64 sys = 130.06 CPU)
        the file code took:30 wallclock secs ( 9.70 usr + 0.24 sys = 9.94 CPU)

        I didn't put the shorter counts in yet because, they would add neglectable time when we create them afterwards. It's at most 3*3*3*2+2=56 additions compared to 6.5 Million increments...

        s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
        +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

      Your solution doesn't quite work. The sum of all the counts in %one, %two and %three should be 600_000, 599_999 and 599_998 respectively, but you end up with 600_000 in all three. Specifically, you end up with one extra substr($sequence, -1) and two extra substr($sequence, -2). (Also, it requires Perl 5.8.0 or higher.)

        Yes. It is mentioned as a caveat in the post that where the subsequence length is not an exact multiple of the sequence length, that the short subsequences will need to be removed (as I did in my second attempt at Re^3: Question about speeding a regexp count).

        But then again, it is so slow compared to the other methods that it doesn't really warrent consideration anyway. I did come up with this version:

        sub browser2 { my %count; $count{ A } = $seq =~ tr[A][A]; $count{ C } = $seq =~ tr[C][C]; $count{ G } = $seq =~ tr[G][G]; $count{ T } = $seq =~ tr[T][T]; for( qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT' ) { my $p=0; $count{ $_ }++ while $p = 1+index $seq, $_, $p; } for( qw[ TTT TTG TTC TTA TGT TGG TGC TGA TCT TCG TCC TCA TAT TAG TAC TA +A GTT GTG GTC GTA GGT GGG GGC GGA GCT GCG GCC GCA GAT GAG GAC GA +A CTT CTG CTC CTA CGT CGG CGC CGA CCT CCG CCC CCA CAT CAG CAC CA +A ATT ATG ATC ATA AGT AGG AGC AGA ACT ACG ACC ACA AAT AAG AAC AA +A ] ) { my $p=0; $count{ $_ }++ while $p = 1+index $seq, $_, $p; } 1; }

        Which is a lot quicker, but still much slower than skeeve's and one of sauoq.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Question about speeding a regexp count
by Skeeve (Parson) on Oct 13, 2005 at 19:40 UTC

    Seems a regex is not the methode of choice here...

    Provided my code has no bugs, see for yourself:

    Output:
    the file code took: 2 wallclock secs ( 0.96 usr + 0.04 sys = 1.00 CPU)
    the scalar code took: 3 wallclock secs ( 0.91 usr + 0.03 sys = 0.94 CPU)
    the regex code took: 6 wallclock secs ( 2.54 usr + 0.06 sys = 2.60 CPU)

    Update: ikegami told me, my code is incomplete because: "...since it only lists groups of $sequencelength, not of $sequencelength or less."

    that's true, but did it appear to you - and correct me if i'm wrong, that you can get the lower counts without counting each of the sequences!?
    Simply loop once for each smaller length over all results and just count the last few sequences at the end which you didn't hit.

    my (@keys)= keys %DNA_sequence_count; foreach my $key (@keys) { for ($i=$sequencelength-1; $i; --$i) { $DNA_sequence_count{substr($key, 0, $i}+= $DNA_sequence_count{$key +}; } } # correction again thanks to ikegami! for ($i=$sequencelength-1; $i; --$i) { # second correction thanks to sauoq for ($j=$i; $j<$sequencelength; ++$j) { ++$DNA_sequence_count{substr($allbuffered, -$j, $i)}; } }
    Small example:
    CATCAT gives CAT 2 ATC 1 TCA 1 Applying that algorithm you get CA 2 (directly from CAT) AT 2 (from ATC + AT at the end) TC 1 (directly from TCA) C 2 A 2 (from AT at the end) T 2 (from TCA + last T)

    Update: sauoq noticed there must be a bug in that correction algorithm and in fact, he's right. fixed now


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Provided my code has no bugs, see for yourself:

      I've noticed persistent differences in the results of your code and mine. (Well, the versions of your code as included in ikegami's benchmarks.) They take the form of 2 to 3 subsequences that differ by a count of 1 or 2. I suspect that this is an easy bug to fix... but since it's your code, I figure you can debug and fix it. ;-)

      That all said, I'd like to point out that 1) including the file reads in your benchmark obscures the issue. As long as the machine has plenty of memory (and maybe yours doesn't) you are contaminating your results with two different methods of reading the data. And 2) you could modify my algorithm and most of the others' to work with chunks as well if RAM really was an issue.

      If you work your bugs out, I am interested in your method of iterating over the shorter strings for the smaller length matches though. I haven't looked at it closely yet, but I'd like to see how that scales.

      -sauoq
      "My two cents aren't worth a dime.";
      

        Well spotted, sauoq! There are indeed bugs:

        1. There was a bug in the implementation of skeeve2
        2. There was a bug in my correction algorithm
        Fixing them does not significantly increase calculation time. for those who are interested, here the code to fix ikegami's benchmark:

        I'll fix my algorithm outlined in my other post.

        Regarding your other issues:

        1. including the file reads in your benchmark obscures the issue. As long as the machine has plenty of memory (and maybe yours doesn't) you are contaminating your results with two different methods of reading the data.

          I don't agree. It was intended that way.

          Reading chunks of data is important if you are low on memory. And I'm 100% sure the OP has to read his data from a file. So reading data is something that is essential for the algorithm.

        2. you could modify my algorithm and most of the others' to work with chunks as well if RAM really was an issue.

          Here I do agree. Of course can it be done. But you didn't ;-)

          I challenge you to do so and then let's compare times.

          But OTOH: My algorithm taking low memory into account is the same as a2, the one ikegami implemented as "skeeve2". So the only difference should be the time needed for reading of data. And if we all do so, and if we all read the same chunk-size, there shouldn't be a difference in the ranking.

          Of course: All of you could rewrite and use my correction algorithm. This would make some of your algorithms significantly faster, I think.

        s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
        +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Question about speeding a regexp count
by ikegami (Patriarch) on Oct 13, 2005 at 20:20 UTC

    Benchmarks:

    s/iter limbic injun radiant ikegami radiant_i sauoq skeeve3 +sauoq_i skeeve3_i skeeve2 limbic 3.35 -- -1% -34% -47% -56% -71% -77% + -77% -81% -91% injun 3.33 1% -- -34% -46% -56% -71% -77% + -77% -81% -91% radiant 2.21 52% 51% -- -19% -33% -56% -65% + -66% -71% -87% ikegami 1.79 87% 86% 23% -- -18% -46% -57% + -58% -64% -84% radiant_i 1.47 128% 126% 50% 22% -- -34% -48% + -48% -56% -80% sauoq 0.969 246% 244% 128% 85% 52% -- -21% + -22% -34% -70% skeeve3 0.769 336% 333% 187% 133% 91% 26% -- + -1% -17% -62% sauoq_i 0.759 342% 339% 191% 136% 94% 28% 1% + -- -16% -61% skeeve3_i 0.641 423% 420% 244% 180% 130% 51% 20% + 19% -- -54% skeeve2 0.294 1042% 1035% 651% 511% 401% 230% 162% + 159% 118% --

    Benchmarks code:

    I edited the solutions to use the same style and to work from memory. I removed any extra work which could be easily added to any solution (such as validation checks).

    I thoroughly checked all the solutions. All of produce indentical results.

    I only ran the tests for 5 iterations because it's slow, but I ran it many times. All results where statistically identical.

    The solutions by BrowserUK, InjunJoel and particularly Limbic~Region use much more memory than the others. All solutions expect the data to be fully in memory.

    BrowserUK's solution requires Perl 5.8+.

    Update: Reran on 5.8 to time BrowserUK.

    Update: Added Skeeve's solutions, now that they are complete.

    Update: Added RadiantMatrix's solution.

    Update: Added Skeeve's bugfixes. Thoroughly verified the results of every test. Removed BrowserUK's solution since it was buggy. Applied hv's microoptimizations

      You can squeeze out some micro-optimisations. In general, ++$x is slightly faster than $x++; playing with skeeve3_i, I also found that /.../ was faster than /.{3}/ (to an extent that surprised me).

      Rate l3epost l3epre l3ipost l3ipre l3epost 42.2/s -- -3% -8% -10% l3epre 43.4/s 3% -- -5% -8% l3ipost 45.8/s 9% 6% -- -3% l3ipre 47.1/s 12% 9% 3% --
      is the output from this code:
      use strict; use Benchmark qw/ cmpthese /; our $a = "abcdefghij" x 1000; cmpthese(-1, { l3epost => q{ my %a; $a{$1}++ while $a =~ /(?=(.{3}))/g }, l3epre => q{ my %a; ++$a{$1} while $a =~ /(?=(.{3}))/g }, l3ipost => q{ my %a; $a{$1}++ while $a =~ /(?=(...))/g }, l3ipre => q{ my %a; ++$a{$1} while $a =~ /(?=(...))/g }, });

      It shouldn't be hard to improve perl to compile $x++ to a preincrement in void context; fixing the regexp engine to make an explicit fixed count as fast as unrolling is likely to be harder.

      Update: Now I'm confused: I checked the source, and void postincrement should be getting replaced with preincrement at compile time; I definitely don't understand now why explicit preinc is showing consistently faster times.

      Hugo

      Very interesting. Looks like I should have gone with sauoq :). Thanks to everyone for their assistance!

      I'd love to see results with 6_500_000 compared to my code "a1" which is reading 1kb chuncks of data from disk. ;-)


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Question about speeding a regexp count
by injunjoel (Priest) on Oct 13, 2005 at 19:30 UTC
    Greetings all,
    I would use a hash and a for loop...
    #!/usr/bin/perl -w use strict; use Dumpvalue; my $d = new Dumpvalue; my %count; my $DNAstr = <DATA>; chomp($DNAstr); my @bases = split //, $DNAstr; for(my $i = 0 ; $i < scalar(@bases); $i++){ $count{$bases[$i]}++; $count{$bases[$i].$bases[$i+1]}++ if(defined $bases[$i+1]); $count{$bases[$i].$bases[$i+1].$bases[$i+2]}++ if(defined $bases[$ +i+2]); } $d->dumpValues(\%count);

    Hope that helps

    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
Re: Question about speeding a regexp count
by radiantmatrix (Parson) on Oct 13, 2005 at 21:28 UTC

    You're missing some info. Can the matches overlap, for example? In other words, should 'AGG' count like this:

    'A' => 1, 'G' => 2, 'AG' => 1, 'GG' => 1, 'AGG' => 1,
    ?

    If so, maybe something like:

    use warnings; use strict; open GENE, '<', '\test.txt' or die("Unable to read: $!"); my @rolling = (undef, undef, undef); my %count; my $cnt; until (eof GENE) { my $char; read(GENE,$char,1); next unless $char=~/[AGCT]/; #make sure it's a valid char; shift @rolling; push @rolling, $char; next unless defined $rolling[2]; $count{$rolling[2]}++; #one-char count next unless defined $rolling[1]; $count{join('',@rolling[1,2])}++; #two-char count next unless defined $rolling[0]; $count{join('',@rolling)}++; #three-char count }

    The hash %count will contain one key for each one-, two-, or three-letter combination found. The value associated with a key is the count of occurances. If you need to know about '0' occurances, you should pre-initialize %count like:

    my @chars = qw[A C G T]; for my $aleph (0..$#chars) { $count{$aleph}=0; for my $beth (0..$#chars) { $count{$aleph.$beth}=0; for my $gimal (0..$#chars) { $count{$aleph.$beth.$gimal}=0; } } }

    I created a test file of 6 million random chars in the set [AGTC] for performance testing. Results:

    60 wallclock secs (59.19 usr + 0.03 sys = 59.22 CPU) @ 0.02/s (n=1)

    This is on a 2.4GHz single machine, not doing much else. Is that fast enough?

    Update: for maintainability, you could use this chunk instead of the all the stuff after push:

    for ( reverse(0..$#rolling) ) { next unless defined $_-@rolling; $count{join('',@rolling[$_-@rolling,-1])}++; }

    Then, by adding more undefs to the initial @rolling, you can handle longer strings.

    <-radiant.matrix->
    A collection of thoughts and links from the minds of geeks
    The Code that can be seen is not the true Code
    "In any sufficiently large group of people, most are idiots" - Kaa's Law
Re: Question about speeding a regexp count
by Roy Johnson (Monsignor) on Oct 14, 2005 at 14:50 UTC
    You can use a regexp, but you ought to make it work in one pass. It's a little tricky, but it benches favorably with ikegami's solution.
    sub roy { my %count; while ($seq =~ /(?=(((.).).))/g) { ++$count{$1}; ++$count{$2}; ++$count{$3}; } # pick up the last two chars $seq =~ /((.)(.))$/; ++$count{$1}; ++$count{$2}; ++$count{$3}; 1; }
    $seq should be at least 3 chars long.

    Caution: Contents may have been coded under pressure.
Re: Question about speeding a regexp count
by scmason (Monk) on Oct 13, 2005 at 19:11 UTC
    I like the previous post, it is a solid method.

    Alternatively, you could unpack the string and simply iterate through it. A bit more code, but perhaps fast enough.

    "Never take yourself too seriously, because everyone knows that fat birds dont fly" -FLC
Re: Question about speeding a regexp count
by ikegami (Patriarch) on Oct 13, 2005 at 19:37 UTC

    Another possibility:

    $count{$1}++ while $seq =~ /(?=(.{1}))/g; $count{$1}++ while $seq =~ /(?=(.{2}))/g; $count{$1}++ while $seq =~ /(?=(.{3}))/g;

    Update: woops, added the (?=) I initially forgot.

      This approach actually does not work, because the regex starts matching after the previous match. in other words, if the $seq is "foobar" it will find 'foo' and then 'bar' but not 'oob' or 'oba'.
        oops, I knew that! Thanks, Fixed.