Re: Exact string matching

Well, since I'm trying to develop my perl skills to use in bioinformatics problems such as this, I thought I'd give it a shot:

use warnings;
use strict;
use Data::Dumper; 


my %word_counts = ();
my $DNA = "CGTAGATCCAGTCGA"; # set for the test code, actual dna shoul
+d be parsed into a single line string with no whitespace
my $cur_len = 3; #set curent word length to minimum word length
my $max_len = (length $DNA) -1; #set maximum word length, set here to 
+avoid recalculating $DNA length for every iteration

for (;$cur_len <= $max_len; $cur_len++){ #for each word length
   my $last_pos = (length $DNA) -$cur_len; #again, set to avoid recalc
+ulating for every iteration
   for (my $pos = 0; $pos <= $last_pos; $pos++){
      $DNA =~ m/^.{$pos}(.{$cur_len})/;
      $word_counts{$1}++;
   }
}
print Dumper(\%word_counts);

exit;
[download]

The bottleneck here would be the ammount of word lengths you search. You could try tweaking that into fixed ranges for multiple program runs if you need to run it quickly. Or at least that's how I'd do it if it was me.

Hope it helps :)

PS: Would any of the fellow monks be kind to tell me if there's a way for the code tag not to break and wrap lines so shortly?

UPDATE: Just realized that code would probably consider AtC and ATC different words, so when you get your DNA sequence into the variable you should also make sure it's all upper or lower cased. like:
$DNA = "\U$DNA";

Comment on Re: Exact string matching Select or Download Code