Hi everyone,
I'm about to test some of these suggestions, but I just thought I'd add a slight correction to the original posting: My number of bases is actually 6.5 million.
Thanks for all the feedback! | [reply] |
#! perl -slw
use strict;
use Time::HiRes qw[ time ];
my $sequence = join'', map{ ( qw[ A C G T ] )[ rand 4 ] } 1 .. 100_000
+;
$sequence x= 65;
my $start = time;
my( %one, %two, %three );
print 'starting one';
for( my $i=0; $i < length( $sequence ) - 100_000; $i += 100_000 ) {
$one{ $_ }++ for unpack '(A1)*', substr $sequence, $i, 100_000;
}
print 'starting two';
for( my $i=0; $i < length( $sequence ) - 99_999; $i += 99_999 ) {
$two{ $_ }++ for unpack '(A2X)*', substr $sequence, $i, 100_000;
}
delete @two{ (qw'A C G T') };;
print 'starting three';
for( my $i=0; $i < length( $sequence ) - 99_999; $i += 99_999 ) {
$three{ $_ }++ for unpack '(A3XX)*', substr $sequence, $i, 100_002
+;
}
delete @three{ (qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT') }
+;
printf "Elapsed: %3f seconds\n", time - $start;
<STDIN>;
print "one\n@{[%one]}";
print "two\n@{[%two ]}";
print "three\n@{[%three]}";
__END__
P:\test>499988
starting one
starting two
starting three
Elapsed: 36.257410 seconds
one
A 1604992 T 1613184 C 1589696 G 1592128
two
AC 404168 CC 403645 TG 409301 AT 408133 AA 413074 CT 409560 CG 399941
+TA 409562
GC 400133 CA 401370 GT 407740 AG 404684 TC 406569 GA 406053 TT 412940
+GG 403062
three
GCC 97240 AGT 100880 TGT 105820 TGA 99385 CGA 99580 ATC 103155 AAC 102
+765
AGC 100490 TAC 99060 TCG 100490 ACA 96460 CTG 103155 CCG 97500 GCA 986
+70
GTG 100230 AAG 100750 CAC 98995 GTT 105820 AGA 102310 ACC 101660 CCA 1
+03480
TGG 101985 CGC 98475 CTC 98995 TTG 105300 TAA 102700 CAG 100880 ACG 10
+1400
AAA 105365 ATG 100620 GTA 100360 CTT 102570 TAG 104650 GGA 104780 GTC
+101335
TGC 102115 TCA 102765 ATT 102765 TAT 103155 AAT 104195 ACT 104650 GAC
+103350
CAA 100230 GGT 98865 TCC 104130 TTT 101790 AGG 101010 CGT 102180 CGG 9
+9710
CAT 101270 ATA 101595 CCC 100620 GGG 100360 GAG 98410 TTA 102765 GAT 9
+9515
CTA 104845 TCT 99190 TTC 103090 GCG 100555 GGC 99060 GAA 104780 GCT 10
+3675
CCT 102050
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
| [reply] [d/l] |
| [reply] [d/l] [select] |
Your solution doesn't quite work. The sum of all the counts in %one, %two and %three should be 600_000, 599_999 and 599_998 respectively, but you end up with 600_000 in all three. Specifically, you end up with one extra substr($sequence, -1) and two extra substr($sequence, -2). (Also, it requires Perl 5.8.0 or higher.)
| [reply] |
Yes. It is mentioned as a caveat in the post that where the subsequence length is not an exact multiple of the sequence length, that the short subsequences will need to be removed (as I did in my second attempt at Re^3: Question about speeding a regexp count).
But then again, it is so slow compared to the other methods that it doesn't really warrent consideration anyway. I did come up with this version:
sub browser2 {
my %count;
$count{ A } = $seq =~ tr[A][A];
$count{ C } = $seq =~ tr[C][C];
$count{ G } = $seq =~ tr[G][G];
$count{ T } = $seq =~ tr[T][T];
for( qw'AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT' ) {
my $p=0;
$count{ $_ }++ while $p = 1+index $seq, $_, $p;
}
for( qw[
TTT TTG TTC TTA TGT TGG TGC TGA TCT TCG TCC TCA TAT TAG TAC TA
+A
GTT GTG GTC GTA GGT GGG GGC GGA GCT GCG GCC GCA GAT GAG GAC GA
+A
CTT CTG CTC CTA CGT CGG CGC CGA CCT CCG CCC CCA CAT CAG CAC CA
+A
ATT ATG ATC ATA AGT AGG AGC AGA ACT ACG ACC ACA AAT AAG AAC AA
+A
] ) {
my $p=0;
$count{ $_ }++ while $p = 1+index $seq, $_, $p;
}
1;
}
Which is a lot quicker, but still much slower than skeeve's and one of sauoq.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
| [reply] [d/l] |
The caveat doesn't apply because my sequence was a multiple of 2 or 3 characters. Therefore, my bug report stands.
| [reply] |