Re: A better (problem-specific) perl hash?

Using the basic idea that if you pack 2-bits for each character together to form a number, your 12-character words' can be represented by an number 0 .. 2**24, you can use that to index into an array of packed integers each requiring 4 bytes, you arrive at a storage requirement of 2**28 bytes/256 MB to cover your 12-bytes words.

The transforming the words into integers can be done like this, (the quickest of 4 methods I tried):

use List::Util qw[ reduce ];
sub xform{
    my $word = shift;
    $word =~ tr[ACGT][\x00\x01\x02\x03];
    reduce{ ($a << 2) + $b } 0, unpack 'C*', $word;
}
[download]

Then you need an implementation of a packed integer array. I tried a tied array, which has the convenience of standard array notation, $index[ xform( $word ) ]++;, but proves to be rather slow. Converting that to an OO representation was about 20% faster:

package OO::Index;

sub new {
    my $self;
    open RAM, '>', \$self;
    seek RAM, 2**28, 0;
    print RAM chr(0);
    close RAM;
    return bless \$self, $_[0];
}

sub get {
    my( $self, $idx ) = @_;
    return unpack 'V', substr $$self, $idx*4, 4;
}

sub set {
    my( $self, $idx, $value ) = @_;
    substr $$self, $idx*4, 4, pack 'V', $value;
    return $value;
}
1;
[download]

A crude benchmark over 1e6 random 12-char keys shows this to be about 40% slower than a hash, but requires only 20% of the memory:

#! perl -sw
use strict;

package OO::Index;

sub new {
    my $self;
    open RAM, '>', \$self;
    seek RAM, 2**28, 0;
    print RAM chr(0);
    close RAM;
    return bless \$self, $_[0];
}

sub get {
    my( $self, $idx ) = @_;
    return unpack 'V', substr $$self, $idx*4, 4;
}

sub set {
    my( $self, $idx, $value ) = @_;
    substr $$self, $idx*4, 4, pack 'V', $value;
    return $value;
}

return 1 if caller;
package main;
use Benchmark qw[ cmpthese ];
use List::Util qw[ reduce ]; $a = $b;

sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] }

sub xform{
    my $word = shift;
    $word =~ tr[ACGT][\x00\x01\x02\x03];
    reduce{ ($a << 2) + $b } 0, unpack 'C*', $word;
}


cmpthese 5, {
    hash => q[
        my %index;
        $index{ rndStr 12, qw[ A C G T ] }++ for 1 .. 1e6;
    ],
    oo => q[
        my $index = new OO::Index;
        for ( 1 .. 1e6 ) {
            my $i = xform rndStr 12, qw[ A C G T ];
            $index->set( $i, 1+ $index->get( $i ) );
        }
    ],
};

__END__
C:\test>junk
     s/iter   oo hash
oo     35.3   -- -58%
hash   15.0 136%   --
[download]

HTH

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: A better (problem-specific) perl hash? Select or Download Code

Replies are listed 'Best First'.
Re^2: A better (problem-specific) perl hash? by adamk (Chaplain) on Feb 18, 2006 at 15:31 UTC
What he said! Convert each gene string to a number. Depending on how much data you have, that isn't a hash. It's a (sparse) array. As long as your string length isn't too long that gives you a fixed cost storage, for as much data as you like (as long as it doesn't overflow the array element size)	[reply]
Re^3: A better (problem-specific) perl hash? by BrowserUk (Patriarch) on Feb 18, 2006 at 16:23 UTC
(as long as it doesn't overflow the array element size) With 16 meg combinations of 12 bps, and a 3 billion (-11 :) bps in all, it averages to 179 of each 12-char strings (assuming random data). I used a 4 bytes integer for each counter, so even if the sequence consisted of entirely one character, there is still plenty of headroom. From some previous analysis I did on the Drosophila genome, the numbers of repeat sequences are quite low. In it's 160MB, repeats counts for subsequences of length 8 are in the low thousands and by the time you get to 12-chars rarely above a few hundred though you do get some 12-char combinations that run to several thousand repeats, particularly those involving T & A for some reason? I coded a version of the above in XS that allows you to specify the element size as 2, 4 or 8 and it will detect attempts to assign values larger. I'm having trouble with conversion between __int64s and doubles for returning to Perl though. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: A better (problem-specific) perl hash? by BrowserUk (Patriarch) on Feb 18, 2006 at 17:31 UTC
that isn't a hash, It's a (sparse) array. Actually, it's a Perfect Hash. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]