Re: A better (problem-specific) perl hash?

I'd go with writing a binary file to hold the counts. Every "word" of length n of your sequence can be interpreted as a number in the following way:

Interpret the letters as digits:

my %value = (
  G => 0,
  A => 1,
  T => 2,
  C => 3,
);
[download]

A word @w then becomes a number by using the following polynomial:

$number = $w[$n]*(4**$n) + $w[$n-1]*(4**($n-1)) + ... + $w[0]
[download]

or more perlish

sub word_to_number {
  my (@w) = @_;
  my $res = 0;
  while (@w) {
    $res = $res * 4;
    $res += $value{ pop @w };
  };
};
[download]

If you limit yourself to a maximum frequency of 2**31 (or 2**32), four bytes will be sufficient to store the count of each word. You then can just seek to the number of the word and increment the count in the file.

open my $frequencies, ">+", 'frequencies.bin'
  or die "$!";
binmode $frequencies;

...

while (words_are_available()) {
  my @word = split //, get_next_word();
  my $offset = word_to_number(@word);

  seek $frequencies, $offset * 4;
  read $frequencies, my $old_count, 4;
  $old_count = unpack "N", $old_count;
  my $count = $old_count + 1;
  seek $requencies, $offset * 4;
  write $frequencies, pack "N", $count;
};
[download]

Comment on Re: A better (problem-specific) perl hash? Select or Download Code

Replies are listed 'Best First'.
Re^2: A better (problem-specific) perl hash? by samtregar (Abbot) on Feb 17, 2006 at 16:03 UTC
This is likely to be considerably slower than the in-memory hash version. If memory suffices I'd be inclined to do the same thing with Bit::Vector, perhaps via my ancient wrapper, Tie::IntegerArray. -sam	[reply]
Re^2: A better (problem-specific) perl hash? by Anonymous Monk on Feb 17, 2006 at 15:24 UTC
Is there a paper behind that polynomial?	[reply]
Re^3: A better (problem-specific) perl hash? by Limbic~Region (Chancellor) on Feb 17, 2006 at 19:38 UTC
Anonymous Monk, What Corion didn't explicitly state is that this is just base conversion. Just as in base 16, we let the letters A-F represent 10-15, Corion is letting A, C, T, G stand for values. It is then a fairly trivial math process to convert between bases. Corion shows one way but it isn't the only one. Cheers - L~R	[reply]
Re: A better (problem-specific) perl hash? by jonadab (Parson) on Feb 17, 2006 at 15:35 UTC
Is there a paper behind that polynomial? Paper? You mean like a white paper? No, it's too basic for that. It's standard introductory algebra, the sort of thing you have to be able to do in your sleep before you can take any higher math.	[reply]