You're missing some info. Can the matches overlap, for example? In other words, should 'AGG' count like this:
?'A' => 1, 'G' => 2, 'AG' => 1, 'GG' => 1, 'AGG' => 1,
If so, maybe something like:
use warnings; use strict; open GENE, '<', '\test.txt' or die("Unable to read: $!"); my @rolling = (undef, undef, undef); my %count; my $cnt; until (eof GENE) { my $char; read(GENE,$char,1); next unless $char=~/[AGCT]/; #make sure it's a valid char; shift @rolling; push @rolling, $char; next unless defined $rolling[2]; $count{$rolling[2]}++; #one-char count next unless defined $rolling[1]; $count{join('',@rolling[1,2])}++; #two-char count next unless defined $rolling[0]; $count{join('',@rolling)}++; #three-char count }
The hash %count will contain one key for each one-, two-, or three-letter combination found. The value associated with a key is the count of occurances. If you need to know about '0' occurances, you should pre-initialize %count like:
my @chars = qw[A C G T]; for my $aleph (0..$#chars) { $count{$aleph}=0; for my $beth (0..$#chars) { $count{$aleph.$beth}=0; for my $gimal (0..$#chars) { $count{$aleph.$beth.$gimal}=0; } } }
I created a test file of 6 million random chars in the set [AGTC] for performance testing. Results:
60 wallclock secs (59.19 usr + 0.03 sys = 59.22 CPU) @ 0.02/s (n=1)
This is on a 2.4GHz single machine, not doing much else. Is that fast enough?
Update: for maintainability, you could use this chunk instead of the all the stuff after push:
for ( reverse(0..$#rolling) ) { next unless defined $_-@rolling; $count{join('',@rolling[$_-@rolling,-1])}++; }
Then, by adding more undefs to the initial @rolling, you can handle longer strings.
In reply to Re: Question about speeding a regexp count
by radiantmatrix
in thread Question about speeding a regexp count
by Commander Salamander
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |