comment on

You're missing some info. Can the matches overlap, for example? In other words, should 'AGG' count like this:

'A'    => 1,
'G'    => 2,
'AG'   => 1,
'GG'   => 1,
'AGG'  => 1,
[download]

If so, maybe something like:

use warnings; use strict;

open GENE, '<', '\test.txt' or die("Unable to read: $!");
my @rolling = (undef, undef, undef);
my %count;
my $cnt;

until (eof GENE) {
   my $char;
   read(GENE,$char,1);
   next unless $char=~/[AGCT]/; #make sure it's a valid char;

   shift @rolling;
   push @rolling, $char;

   next unless defined $rolling[2];
   $count{$rolling[2]}++;            #one-char count

   next unless defined $rolling[1];
   $count{join('',@rolling[1,2])}++; #two-char count

   next unless defined $rolling[0]; 
   $count{join('',@rolling)}++;      #three-char count
}
[download]

The hash %count will contain one key for each one-, two-, or three-letter combination found. The value associated with a key is the count of occurances. If you need to know about '0' occurances, you should pre-initialize %count like:

my @chars = qw[A C G T];
for my $aleph (0..$#chars) {
   $count{$aleph}=0;
   for my $beth (0..$#chars) {
      $count{$aleph.$beth}=0;
      for my $gimal (0..$#chars) {
         $count{$aleph.$beth.$gimal}=0;
      }
   }
}
[download]

I created a test file of 6 million random chars in the set [AGTC] for performance testing. Results:

60 wallclock secs (59.19 usr +  0.03 sys = 59.22 CPU) @  0.02/s (n=1)
[download]

This is on a 2.4GHz single machine, not doing much else. Is that fast enough?

Update: for maintainability, you could use this chunk instead of the all the stuff after push:

for ( reverse(0..$#rolling) ) {
   next unless defined $_-@rolling;
   $count{join('',@rolling[$_-@rolling,-1])}++;
}
[download]

Then, by adding more undefs to the initial @rolling, you can handle longer strings.

<-radiant.matrix->
A collection of thoughts and links from the minds of geeks
The Code that can be seen is not the true Code
"In any sufficiently large group of people, most are idiots" - Kaa's Law

In reply to Re: Question about speeding a regexp count by radiantmatrix
in thread Question about speeding a regexp count by Commander Salamander

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.