Well, since I'm trying to develop my perl skills to use in bioinformatics problems such as this, I thought I'd give it a shot:
use warnings;
use strict;
use Data::Dumper;
my %word_counts = ();
my $DNA = "CGTAGATCCAGTCGA"; # set for the test code, actual dna shoul
+d be parsed into a single line string with no whitespace
my $cur_len = 3; #set curent word length to minimum word length
my $max_len = (length $DNA) -1; #set maximum word length, set here to
+avoid recalculating $DNA length for every iteration
for (;$cur_len <= $max_len; $cur_len++){ #for each word length
my $last_pos = (length $DNA) -$cur_len; #again, set to avoid recalc
+ulating for every iteration
for (my $pos = 0; $pos <= $last_pos; $pos++){
$DNA =~ m/^.{$pos}(.{$cur_len})/;
$word_counts{$1}++;
}
}
print Dumper(\%word_counts);
exit;
The bottleneck here would be the ammount of word lengths you search. You could try tweaking that into fixed ranges for multiple program runs if you need to run it quickly. Or at least that's how I'd do it if it was me.
Hope it helps :)
PS: Would any of the fellow monks be kind to tell me if there's a way for the code tag not to break and wrap lines so shortly?
UPDATE: Just realized that code would probably consider AtC and ATC different words, so when you get your DNA sequence into the variable you should also make sure it's all upper or lower cased. like:
$DNA = "\U$DNA";
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.