Essentially then, you have a sparse matrix of booleans. With a density of 1/500, bitvectors will not produce the most compact form.

However, the most compact form is not of essence here. When you search arbitrary data in multiple dimensions or by multiple keys, you'll need to index by each key. In effect you create a number of copies of the data. Here the index (address) of a bit is many times bigger than that one bit itself.

Now, here's the solution I'm thinking of. Make an AoA of the vectors X, bucketed by bit ID. That is, 15k buckets each containing estimated 200 ID's of vectors X. Make an array N[X] that counts hits per vector X. These structures you can keep in memory.

When processing, preallocate arrays and push the X's. Then, for each vector Y, store the hits: store($x, $N[$x]++, $y); With ca 30*200 hits per vector, total hits are 5M*6k == 30G. Say 3* the expected hits, 4 bytes per id, this comes to 3*4*30G == 360 GB. The store just writes integer y at offset 3.6e6*x + 4*N.

Finding best matches is a linear scan of that 360GB file. Each 3.6MB segment corresponds to all hits of vector X. Sort(?), count, keep the best. With a big cluster, you could partition the whole processing and run it all in memory. Anyway, use a mmap'ed file.

Update: Modern CPU's may have 512 DTLB entries or so. It is probably best if you process no more than ~500 X-vectors at a time (per thread). This amounts to 200 passes over your Y-vectors. Hence, compacting the Y ought to be the first step. Again, a cluster of 200+ cores might come handy. Actually, this won't be a bottleneck (hm?).

Update2. Reconsidering the above, I realise intermediate storage is unnecessary. However, if you populate 15k buckets with all Y vectors (this time), there will be approx 10k entries per bucket, totaling 4*10k*15k = 600M bytes. Sort each bucket. Matching an X vector then involves scanning ~30 ways *10k entries in a manner that is very similar to merge sort. This should take less than a second... (Inline C)


In reply to Re: Comparing two arrays by oiskuu
in thread Comparing two arrays by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.