I'd go with writing a binary file to hold the counts. Every "word" of length n of your sequence can be interpreted as a number in the following way:
Interpret the letters as digits:
my %value = ( G => 0, A => 1, T => 2, C => 3, );
A word @w then becomes a number by using the following polynomial:
$number = $w[$n]*(4**$n) + $w[$n-1]*(4**($n-1)) + ... + $w[0]
or more perlish
sub word_to_number { my (@w) = @_; my $res = 0; while (@w) { $res = $res * 4; $res += $value{ pop @w }; }; };
If you limit yourself to a maximum frequency of 2**31 (or 2**32), four bytes will be sufficient to store the count of each word. You then can just seek to the number of the word and increment the count in the file.
open my $frequencies, ">+", 'frequencies.bin' or die "$!"; binmode $frequencies; ... while (words_are_available()) { my @word = split //, get_next_word(); my $offset = word_to_number(@word); seek $frequencies, $offset * 4; read $frequencies, my $old_count, 4; $old_count = unpack "N", $old_count; my $count = $old_count + 1; seek $requencies, $offset * 4; write $frequencies, pack "N", $count; };
In reply to Re: A better (problem-specific) perl hash?
by Corion
in thread A better (problem-specific) perl hash?
by srdst13
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |