I have an application where I need to be able to map some strings to some integers and then be able to lookup the integers from their strings or vice versa. (NOTE: both strings and integers are unique!)

In Perl this can be (and currently is) achieved by using two hashes:

my %intByStr = populate(); my %strByInts; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr; ...

But the penalty for this is that every key and value is stored twice. Which means for a fairly modest dataset of just 450,000 key/value pairs, the total storage requirements are 95MB:

$i=1; $intByStr{ $_ } = $i++ for 'aaaa'..'zzzz';; $strByInt{ intByStr{ $_ } } = $_ for keys %intByStr;; print total_size( $_ ) for \(%intByStr, %strByInt);; 50805928 44297159

As my dataset can be somewhat larger, I'd like to minimise the space requirement. A modest improvement can be achieved by storing both mappings in a single hash:

undef %AbyB; $i=1; $AbyB{ $_ } = $i, $AbyB{ $i++ } = $_ for 'aaaa'..' +zzzz';; print total_size( \%AbyB );; 80479783

The potential space requirement for my dataset, which might grow to 10x (or 20x or more) the size of the examples above, is such that I'm looking for an alternative, even if it means coding the solution in (Inline::)C, but simply using Perl's hashes from C won't gain me anything.

I could use a more space efficient hash implementation, which I found a few of on the net; that usually come with a penalty of less flexibility which I could live with for this purpose.

Somewhere in the back of my mind I seem to recall a bidirectional lookup mechanism/algorithm that didn't require the duplicate storage, but for the life of me I cannot remember anything about how it worked; and my attempts to search for "bidirectional search algorithm" all turn up variations on this which is a shortest path between two node of a graph algorithm and thus not useful for me.

To the question:

Does anyone out there know/remember/have any pointers to "bidirection lookup algorithm" that only stores the values once?

Update per /msg request for further info:

The strings are usually fairly short, 1 to 8 characters, but sometimes can extend further upto 32 or occasionally more. (Think: typical variable names!)

The integers are (unique) 64-bit values. Think memory addresses for 64-bit processors.

The data is loaded once. The two lookup mechanisms (hashes/arrays/indexes/whatever) would be built once as, or after, the data is read; so build/insert speed is not a great concern.

The vast majority of the uses will be lookups, many millions, so the priority is the lookup speed -- preferably O(1) though O(logN) might be acceptable if the memory reduction was sufficient.

The data is discarded at the end of each run. There are no deletes.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Bidirectional lookup algorithm? (Updated: further info.) by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.