Using the basic idea that if you pack 2-bits for each character together to form a number, your 12-character words' can be represented by an number 0 .. 2**24, you can use that to index into an array of packed integers each requiring 4 bytes, you arrive at a storage requirement of 2**28 bytes/256 MB to cover your 12-bytes words.

The transforming the words into integers can be done like this, (the quickest of 4 methods I tried):

use List::Util qw[ reduce ]; sub xform{ my $word = shift; $word =~ tr[ACGT][\x00\x01\x02\x03]; reduce{ ($a << 2) + $b } 0, unpack 'C*', $word; }

Then you need an implementation of a packed integer array. I tried a tied array, which has the convenience of standard array notation, $index[ xform( $word ) ]++;, but proves to be rather slow. Converting that to an OO representation was about 20% faster:

package OO::Index; sub new { my $self; open RAM, '>', \$self; seek RAM, 2**28, 0; print RAM chr(0); close RAM; return bless \$self, $_[0]; } sub get { my( $self, $idx ) = @_; return unpack 'V', substr $$self, $idx*4, 4; } sub set { my( $self, $idx, $value ) = @_; substr $$self, $idx*4, 4, pack 'V', $value; return $value; } 1;

A crude benchmark over 1e6 random 12-char keys shows this to be about 40% slower than a hash, but requires only 20% of the memory:

#! perl -sw use strict; package OO::Index; sub new { my $self; open RAM, '>', \$self; seek RAM, 2**28, 0; print RAM chr(0); close RAM; return bless \$self, $_[0]; } sub get { my( $self, $idx ) = @_; return unpack 'V', substr $$self, $idx*4, 4; } sub set { my( $self, $idx, $value ) = @_; substr $$self, $idx*4, 4, pack 'V', $value; return $value; } return 1 if caller; package main; use Benchmark qw[ cmpthese ]; use List::Util qw[ reduce ]; $a = $b; sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] } sub xform{ my $word = shift; $word =~ tr[ACGT][\x00\x01\x02\x03]; reduce{ ($a << 2) + $b } 0, unpack 'C*', $word; } cmpthese 5, { hash => q[ my %index; $index{ rndStr 12, qw[ A C G T ] }++ for 1 .. 1e6; ], oo => q[ my $index = new OO::Index; for ( 1 .. 1e6 ) { my $i = xform rndStr 12, qw[ A C G T ]; $index->set( $i, 1+ $index->get( $i ) ); } ], }; __END__ C:\test>junk s/iter oo hash oo 35.3 -- -58% hash 15.0 136% --

HTH


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: A better (problem-specific) perl hash? by BrowserUk
in thread A better (problem-specific) perl hash? by srdst13

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.