in reply to A better (problem-specific) perl hash?
I'd go with writing a binary file to hold the counts. Every "word" of length n of your sequence can be interpreted as a number in the following way:
Interpret the letters as digits:
my %value = ( G => 0, A => 1, T => 2, C => 3, );
A word @w then becomes a number by using the following polynomial:
$number = $w[$n]*(4**$n) + $w[$n-1]*(4**($n-1)) + ... + $w[0]
or more perlish
sub word_to_number { my (@w) = @_; my $res = 0; while (@w) { $res = $res * 4; $res += $value{ pop @w }; }; };
If you limit yourself to a maximum frequency of 2**31 (or 2**32), four bytes will be sufficient to store the count of each word. You then can just seek to the number of the word and increment the count in the file.
open my $frequencies, ">+", 'frequencies.bin' or die "$!"; binmode $frequencies; ... while (words_are_available()) { my @word = split //, get_next_word(); my $offset = word_to_number(@word); seek $frequencies, $offset * 4; read $frequencies, my $old_count, 4; $old_count = unpack "N", $old_count; my $count = $old_count + 1; seek $requencies, $offset * 4; write $frequencies, pack "N", $count; };
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: A better (problem-specific) perl hash?
by samtregar (Abbot) on Feb 17, 2006 at 16:03 UTC | |
|
Re^2: A better (problem-specific) perl hash?
by Anonymous Monk on Feb 17, 2006 at 15:24 UTC | |
by Limbic~Region (Chancellor) on Feb 17, 2006 at 19:38 UTC | |
by jonadab (Parson) on Feb 17, 2006 at 15:35 UTC |