in reply to A better (problem-specific) perl hash?

Using the basic idea that if you pack 2-bits for each character together to form a number, your 12-character words' can be represented by an number 0 .. 2**24, you can use that to index into an array of packed integers each requiring 4 bytes, you arrive at a storage requirement of 2**28 bytes/256 MB to cover your 12-bytes words.

The transforming the words into integers can be done like this, (the quickest of 4 methods I tried):

use List::Util qw[ reduce ]; sub xform{ my $word = shift; $word =~ tr[ACGT][\x00\x01\x02\x03]; reduce{ ($a << 2) + $b } 0, unpack 'C*', $word; }

Then you need an implementation of a packed integer array. I tried a tied array, which has the convenience of standard array notation, $index[ xform( $word ) ]++;, but proves to be rather slow. Converting that to an OO representation was about 20% faster:

package OO::Index; sub new { my $self; open RAM, '>', \$self; seek RAM, 2**28, 0; print RAM chr(0); close RAM; return bless \$self, $_[0]; } sub get { my( $self, $idx ) = @_; return unpack 'V', substr $$self, $idx*4, 4; } sub set { my( $self, $idx, $value ) = @_; substr $$self, $idx*4, 4, pack 'V', $value; return $value; } 1;

A crude benchmark over 1e6 random 12-char keys shows this to be about 40% slower than a hash, but requires only 20% of the memory:

#! perl -sw use strict; package OO::Index; sub new { my $self; open RAM, '>', \$self; seek RAM, 2**28, 0; print RAM chr(0); close RAM; return bless \$self, $_[0]; } sub get { my( $self, $idx ) = @_; return unpack 'V', substr $$self, $idx*4, 4; } sub set { my( $self, $idx, $value ) = @_; substr $$self, $idx*4, 4, pack 'V', $value; return $value; } return 1 if caller; package main; use Benchmark qw[ cmpthese ]; use List::Util qw[ reduce ]; $a = $b; sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] } sub xform{ my $word = shift; $word =~ tr[ACGT][\x00\x01\x02\x03]; reduce{ ($a << 2) + $b } 0, unpack 'C*', $word; } cmpthese 5, { hash => q[ my %index; $index{ rndStr 12, qw[ A C G T ] }++ for 1 .. 1e6; ], oo => q[ my $index = new OO::Index; for ( 1 .. 1e6 ) { my $i = xform rndStr 12, qw[ A C G T ]; $index->set( $i, 1+ $index->get( $i ) ); } ], }; __END__ C:\test>junk s/iter oo hash oo 35.3 -- -58% hash 15.0 136% --

HTH


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: A better (problem-specific) perl hash?
by adamk (Chaplain) on Feb 18, 2006 at 15:31 UTC
    What he said!

    Convert each gene string to a number.

    Depending on how much data you have, that isn't a hash.

    It's a (sparse) array.

    As long as your string length isn't too long that gives you a fixed cost storage, for as much data as you like (as long as it doesn't overflow the array element size)
      (as long as it doesn't overflow the array element size)

      With 16 meg combinations of 12 bps, and a 3 billion (-11 :) bps in all, it averages to 179 of each 12-char strings (assuming random data). I used a 4 bytes integer for each counter, so even if the sequence consisted of entirely one character, there is still plenty of headroom.

      From some previous analysis I did on the Drosophila genome, the numbers of repeat sequences are quite low. In it's 160MB, repeats counts for subsequences of length 8 are in the low thousands and by the time you get to 12-chars rarely above a few hundred though you do get some 12-char combinations that run to several thousand repeats, particularly those involving T & A for some reason?

      I coded a version of the above in XS that allows you to specify the element size as 2, 4 or 8 and it will detect attempts to assign values larger. I'm having trouble with conversion between __int64s and doubles for returning to Perl though.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      that isn't a hash, It's a (sparse) array.

      Actually, it's a Perfect Hash.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.