in reply to Question about speeding a regexp count

You're missing some info. Can the matches overlap, for example? In other words, should 'AGG' count like this:

'A' => 1, 'G' => 2, 'AG' => 1, 'GG' => 1, 'AGG' => 1,
?

If so, maybe something like:

use warnings; use strict; open GENE, '<', '\test.txt' or die("Unable to read: $!"); my @rolling = (undef, undef, undef); my %count; my $cnt; until (eof GENE) { my $char; read(GENE,$char,1); next unless $char=~/[AGCT]/; #make sure it's a valid char; shift @rolling; push @rolling, $char; next unless defined $rolling[2]; $count{$rolling[2]}++; #one-char count next unless defined $rolling[1]; $count{join('',@rolling[1,2])}++; #two-char count next unless defined $rolling[0]; $count{join('',@rolling)}++; #three-char count }

The hash %count will contain one key for each one-, two-, or three-letter combination found. The value associated with a key is the count of occurances. If you need to know about '0' occurances, you should pre-initialize %count like:

my @chars = qw[A C G T]; for my $aleph (0..$#chars) { $count{$aleph}=0; for my $beth (0..$#chars) { $count{$aleph.$beth}=0; for my $gimal (0..$#chars) { $count{$aleph.$beth.$gimal}=0; } } }

I created a test file of 6 million random chars in the set [AGTC] for performance testing. Results:

60 wallclock secs (59.19 usr + 0.03 sys = 59.22 CPU) @ 0.02/s (n=1)

This is on a 2.4GHz single machine, not doing much else. Is that fast enough?

Update: for maintainability, you could use this chunk instead of the all the stuff after push:

for ( reverse(0..$#rolling) ) { next unless defined $_-@rolling; $count{join('',@rolling[$_-@rolling,-1])}++; }

Then, by adding more undefs to the initial @rolling, you can handle longer strings.

<-radiant.matrix->
A collection of thoughts and links from the minds of geeks
The Code that can be seen is not the true Code
"In any sufficiently large group of people, most are idiots" - Kaa's Law