comment on

Using the basic idea that if you pack 2-bits for each character together to form a number, your 12-character words' can be represented by an number 0 .. 2**24, you can use that to index into an array of packed integers each requiring 4 bytes, you arrive at a storage requirement of 2**28 bytes/256 MB to cover your 12-bytes words.

The transforming the words into integers can be done like this, (the quickest of 4 methods I tried):

use List::Util qw[ reduce ];
sub xform{
    my $word = shift;
    $word =~ tr[ACGT][\x00\x01\x02\x03];
    reduce{ ($a << 2) + $b } 0, unpack 'C*', $word;
}
[download]

Then you need an implementation of a packed integer array. I tried a tied array, which has the convenience of standard array notation, $index[ xform( $word ) ]++;, but proves to be rather slow. Converting that to an OO representation was about 20% faster:

package OO::Index;

sub new {
    my $self;
    open RAM, '>', \$self;
    seek RAM, 2**28, 0;
    print RAM chr(0);
    close RAM;
    return bless \$self, $_[0];
}

sub get {
    my( $self, $idx ) = @_;
    return unpack 'V', substr $$self, $idx*4, 4;
}

sub set {
    my( $self, $idx, $value ) = @_;
    substr $$self, $idx*4, 4, pack 'V', $value;
    return $value;
}
1;
[download]

A crude benchmark over 1e6 random 12-char keys shows this to be about 40% slower than a hash, but requires only 20% of the memory:

#! perl -sw
use strict;

package OO::Index;

sub new {
    my $self;
    open RAM, '>', \$self;
    seek RAM, 2**28, 0;
    print RAM chr(0);
    close RAM;
    return bless \$self, $_[0];
}

sub get {
    my( $self, $idx ) = @_;
    return unpack 'V', substr $$self, $idx*4, 4;
}

sub set {
    my( $self, $idx, $value ) = @_;
    substr $$self, $idx*4, 4, pack 'V', $value;
    return $value;
}

return 1 if caller;
package main;
use Benchmark qw[ cmpthese ];
use List::Util qw[ reduce ]; $a = $b;

sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] }

sub xform{
    my $word = shift;
    $word =~ tr[ACGT][\x00\x01\x02\x03];
    reduce{ ($a << 2) + $b } 0, unpack 'C*', $word;
}


cmpthese 5, {
    hash => q[
        my %index;
        $index{ rndStr 12, qw[ A C G T ] }++ for 1 .. 1e6;
    ],
    oo => q[
        my $index = new OO::Index;
        for ( 1 .. 1e6 ) {
            my $i = xform rndStr 12, qw[ A C G T ];
            $index->set( $i, 1+ $index->get( $i ) );
        }
    ],
};

__END__
C:\test>junk
     s/iter   oo hash
oo     35.3   -- -58%
hash   15.0 136%   --
[download]

HTH

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: A better (problem-specific) perl hash? by BrowserUk
in thread A better (problem-specific) perl hash? by srdst13

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.