in reply to Question about speeding a regexp count

Add this to your considerations. It does 600_000 char sequence in 3.39 seconds.

Updated to correct caveat.

#! perl -slw use strict; use List::Util qw[sum]; use Time::HiRes qw[ time ]; my $sequence = join'', map{ ( qw[ A C G T ] )[ rand 4 ] } 1 .. 600_000 +; my $start = time; my( %one, %two, %three ); $one{ $_ }++ for unpack '(A1)*', $sequence; $two{ $_ }++ for unpack '(A2X)*', $sequence; delete @two{ (qw'A C G T') }; $three{ $_ }++ for unpack '(A3XX)*', $sequence; delete @three{ (qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT') } +; printf "Elapsed: %3f seconds\n", time - $start; print sum values %one; print sum values %two; print sum values %three; <STDIN>; print "$_ => $one{ $_ }" for keys %one;; print "$_ => $two{ $_ }" for keys %two;; print "$_ => $three{ $_ }" for keys %three;; __END__ P:\test>499988 Elapsed: 3.390625 seconds 600000 599999 599998

It has one caveat. Where the sequence is not a multiple of 2 or 3 characters, the last 1 or two chars will get counted. This is easily filtered.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Replies are listed 'Best First'.
Re^2: Question about speeding a regexp count
by Commander Salamander (Acolyte) on Oct 13, 2005 at 20:04 UTC
    Hi everyone, I'm about to test some of these suggestions, but I just thought I'd add a slight correction to the original posting: My number of bases is actually 6.5 million. Thanks for all the feedback!

      A little over 10x the data takes a little over 10x as long, 36 seconds, though it needed a few tweaks to stop it from blowing gobs of memory:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; my $sequence = join'', map{ ( qw[ A C G T ] )[ rand 4 ] } 1 .. 100_000 +; $sequence x= 65; my $start = time; my( %one, %two, %three ); print 'starting one'; for( my $i=0; $i < length( $sequence ) - 100_000; $i += 100_000 ) { $one{ $_ }++ for unpack '(A1)*', substr $sequence, $i, 100_000; } print 'starting two'; for( my $i=0; $i < length( $sequence ) - 99_999; $i += 99_999 ) { $two{ $_ }++ for unpack '(A2X)*', substr $sequence, $i, 100_000; } delete @two{ (qw'A C G T') };; print 'starting three'; for( my $i=0; $i < length( $sequence ) - 99_999; $i += 99_999 ) { $three{ $_ }++ for unpack '(A3XX)*', substr $sequence, $i, 100_002 +; } delete @three{ (qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT') } +; printf "Elapsed: %3f seconds\n", time - $start; <STDIN>; print "one\n@{[%one]}"; print "two\n@{[%two ]}"; print "three\n@{[%three]}"; __END__ P:\test>499988 starting one starting two starting three Elapsed: 36.257410 seconds one A 1604992 T 1613184 C 1589696 G 1592128 two AC 404168 CC 403645 TG 409301 AT 408133 AA 413074 CT 409560 CG 399941 +TA 409562 GC 400133 CA 401370 GT 407740 AG 404684 TC 406569 GA 406053 TT 412940 +GG 403062 three GCC 97240 AGT 100880 TGT 105820 TGA 99385 CGA 99580 ATC 103155 AAC 102 +765 AGC 100490 TAC 99060 TCG 100490 ACA 96460 CTG 103155 CCG 97500 GCA 986 +70 GTG 100230 AAG 100750 CAC 98995 GTT 105820 AGA 102310 ACC 101660 CCA 1 +03480 TGG 101985 CGC 98475 CTC 98995 TTG 105300 TAA 102700 CAG 100880 ACG 10 +1400 AAA 105365 ATG 100620 GTA 100360 CTT 102570 TAG 104650 GGA 104780 GTC +101335 TGC 102115 TCA 102765 ATT 102765 TAT 103155 AAT 104195 ACT 104650 GAC +103350 CAA 100230 GGT 98865 TCC 104130 TTT 101790 AGG 101010 CGT 102180 CGG 9 +9710 CAT 101270 ATA 101595 CCC 100620 GGG 100360 GAG 98410 TTA 102765 GAT 9 +9515 CTA 104845 TCT 99190 TTC 103090 GCG 100555 GGC 99060 GAA 104780 GCT 10 +3675 CCT 102050

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      I just compared my "a1" algorithem with the slowesr ;-) so far for 6_500_000 characters:

      the unpack code took:988 wallclock secs (102.42 usr + 27.64 sys = 130.06 CPU)
      the file code took:30 wallclock secs ( 9.70 usr + 0.24 sys = 9.94 CPU)

      I didn't put the shorter counts in yet because, they would add neglectable time when we create them afterwards. It's at most 3*3*3*2+2=56 additions compared to 6.5 Million increments...

      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re^2: Question about speeding a regexp count
by ikegami (Patriarch) on Oct 14, 2005 at 16:17 UTC

    Your solution doesn't quite work. The sum of all the counts in %one, %two and %three should be 600_000, 599_999 and 599_998 respectively, but you end up with 600_000 in all three. Specifically, you end up with one extra substr($sequence, -1) and two extra substr($sequence, -2). (Also, it requires Perl 5.8.0 or higher.)

      Yes. It is mentioned as a caveat in the post that where the subsequence length is not an exact multiple of the sequence length, that the short subsequences will need to be removed (as I did in my second attempt at Re^3: Question about speeding a regexp count).

      But then again, it is so slow compared to the other methods that it doesn't really warrent consideration anyway. I did come up with this version:

      sub browser2 { my %count; $count{ A } = $seq =~ tr[A][A]; $count{ C } = $seq =~ tr[C][C]; $count{ G } = $seq =~ tr[G][G]; $count{ T } = $seq =~ tr[T][T]; for( qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT' ) { my $p=0; $count{ $_ }++ while $p = 1+index $seq, $_, $p; } for( qw[ TTT TTG TTC TTA TGT TGG TGC TGA TCT TCG TCC TCA TAT TAG TAC TA +A GTT GTG GTC GTA GGT GGG GGC GGA GCT GCG GCC GCA GAT GAG GAC GA +A CTT CTG CTC CTA CGT CGG CGC CGA CCT CCG CCC CCA CAT CAG CAC CA +A ATT ATG ATC ATA AGT AGG AGC AGA ACT ACG ACC ACA AAT AAG AAC AA +A ] ) { my $p=0; $count{ $_ }++ while $p = 1+index $seq, $_, $p; } 1; }

      Which is a lot quicker, but still much slower than skeeve's and one of sauoq.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
        The caveat doesn't apply because my sequence was a multiple of 2 or 3 characters. Therefore, my bug report stands.