Okay. I have my first pass solution completed, and here (as promised) is the code (in the form of an Inline::C script for RAD and testing), and a brief description.

Description:

The code presents a class: BiMap; which supports the following methods:

The basic structure of the BiMap contains pointers to two parallel arrays: byInt[] of PAIR structures; byStr[] of U32 integers;

typedef struct { U64 addr; char *name; } PAIR; typedef struct { PAIR **byInt; U32 *byStr; U32 size; U32 used; double factor; } BIMAP;

The PAIRs are inserted into byInt[] hashed by the (U64)addr in the PAIR.

The byStr[] holds indexes (+1 to allow 0 to represent empty), of the PAIRs, hashed by the (char*)name in the PAIR. Using indexes into byInt[] rather than address of the pairs themselves saves half the space at the cost of an extra indirection.

As a large majority of the names will be less than 8 characters; when that is the case the char* name component of the PAIR is used to hold the string directly, rather than a pointer to it, thus saving another 8-bytes per compliant PAIR. The high bit of the 8th byte is set to distinguish between directly stored strings and addresses of strings.

I tested several combinations of: a) various hash functions; and b) various probing algorithms; and in my (simplistic) testing; no other combination beat the performance of the (Jenkins) one-at-a-time hash function erstwhile used by Perl; and the very simple, +1 probing mechanism.

Results:

Using a single, combined Perl hash to look up 11 million strings from their associated 64-bit integer address, and vice versa, takes a total of 15.9 seconds and requires 4.17 GB:

C:\test>1112165-hash Start: mem: 7,920 K 23762752 Hash built: mem: 4,382,828 K Fetch by Name took 7.539046 seconds Fetch by Addr took 8.349479 seconds Hash queried: check mem: 4,382,852 K

Using BiMap on the same dataset using a table presized to accomodate 16 million pairs, and a 0.71 fill factor. takes 22.3 seconds and requires just over 1/2GB:

C:\test>BiMap -S=5 -F=0.71 -I=12000000 -D=0 bless(do{\(my $o = 52848976)}, "BiMap") Fetch by Addr & Fetch by Name took 22.257803 seconds for 11881376 item +s in a 16777216 sized BiMap Mem: 580,552 K

By moving to a fill factor of 0.70 and pre-sizing the arrays to accommodate 33 million pairs, takes 19.8 seconds and requires just over 3/4GB:

C:\test>BiMap -S=5 -F=0.70 -I=20000000 -D=0 bless(do{\(my $o = 52848976)}, "BiMap") Fetch by Addr & Fetch by Name took 19.896479 seconds for 11881376 item +s in a 33554432 sized BiMap Mem: 777,532 K

Conclusion:

Using a BiMap in preference to a Perl hash will allow 8 times as much data to be processed per run; thus both reducing the number of runs required whilst increasing the statistical validity of the results by well over 8 times.

Size for size, the total runtime of the tests will increase by approximately 40%. A fair penalty for the substantial increase in statistical validity.

Other options:

The code:


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In reply to Re: Bidirectional lookup algorithm? (Solution.) by BrowserUk
in thread Bidirectional lookup algorithm? (Updated: further info.) by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.