in reply to Re: Question about speeding a regexp count
in thread Question about speeding a regexp count

Your solution doesn't quite work. The sum of all the counts in %one, %two and %three should be 600_000, 599_999 and 599_998 respectively, but you end up with 600_000 in all three. Specifically, you end up with one extra substr($sequence, -1) and two extra substr($sequence, -2). (Also, it requires Perl 5.8.0 or higher.)

  • Comment on Re^2: Question about speeding a regexp count

Replies are listed 'Best First'.
Re^3: Question about speeding a regexp count
by BrowserUk (Patriarch) on Oct 14, 2005 at 16:59 UTC

    Yes. It is mentioned as a caveat in the post that where the subsequence length is not an exact multiple of the sequence length, that the short subsequences will need to be removed (as I did in my second attempt at Re^3: Question about speeding a regexp count).

    But then again, it is so slow compared to the other methods that it doesn't really warrent consideration anyway. I did come up with this version:

    sub browser2 { my %count; $count{ A } = $seq =~ tr[A][A]; $count{ C } = $seq =~ tr[C][C]; $count{ G } = $seq =~ tr[G][G]; $count{ T } = $seq =~ tr[T][T]; for( qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT' ) { my $p=0; $count{ $_ }++ while $p = 1+index $seq, $_, $p; } for( qw[ TTT TTG TTC TTA TGT TGG TGC TGA TCT TCG TCC TCA TAT TAG TAC TA +A GTT GTG GTC GTA GGT GGG GGC GGA GCT GCG GCC GCA GAT GAG GAC GA +A CTT CTG CTC CTA CGT CGG CGC CGA CCT CCG CCC CCA CAT CAG CAC CA +A ATT ATG ATC ATA AGT AGG AGC AGA ACT ACG ACC ACA AAT AAG AAC AA +A ] ) { my $p=0; $count{ $_ }++ while $p = 1+index $seq, $_, $p; } 1; }

    Which is a lot quicker, but still much slower than skeeve's and one of sauoq.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      The caveat doesn't apply because my sequence was a multiple of 2 or 3 characters. Therefore, my bug report stands.
        my bug report stands.

        My (described) two-line bugfix applied. No dramatic difference to the performance of the algorithm

        #! perl -slw use strict; use Time::HiRes qw[ time ]; use List::Util qw[sum];; my $sequence = join'', map{ ( qw[ A C G T ] )[ rand 4 ] } 1 .. 600_000 +; my $start = time; my( %one, %two, %three ); $one{ $_ }++ for unpack '(A1)*', $sequence; $two{ $_ }++ for unpack '(A2X)*', $sequence; delete @two{ (qw'A C G T') };; $three{ $_ }++ for unpack '(A3XX)*', $sequence; delete @three{ (qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT') } +; printf "Elapsed: %3f seconds\n", time - $start; print sum values %one; print sum values %two; print sum values %three; __END__ P:\test>junk Elapsed: 3.390625 seconds 600000 599999 599998

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.