As indicated elsewhere, the quick and easy way is to use somebody else's carefully crafted sort program to sort both files and then run them against each other.

However, you did ask:

What approach would be really super-fast?

If the quick and easy approach is more or less fast enough, stop now. If you're not going to do this very often, stop now. Otherwise...

There are ~2 * 10^19 different 32 character strings. You have ~n * 10^7 of those and ~n * 10^8 you want to match against. This suggests a fairly low hit rate. Yes ?

If so, then we apply rule some-small-integer, and try to hack the problem down to size. Rather than attempt a lot of exact matching, I suggest the trick here is first to use some inexact matching, to eliminate strings for which there is definitely no match.

The obvious way to do that is to hash the strings into a largish integer and use a bit vector to record which hashes exist in the smaller file. I'd try a 30-bit hash, but 32 might work better (assuming a 512MB bit vector is not a shock to your system).

The process is then:

  1. scan the smaller file and populate the bit-vector according to the hash values.
  2. scan the larger file and record any strings whose hash values appear in the bit-vector.
  3. with any luck, there are now a small number of candidate strings, small enoough that is to be held in memory -- in particular in a hash table.
  4. scan the small file again, and see if anything is found in the hash table.

If both files are going to be different each time, we're done. Otherwise, you could build and save the bit-vector for the file that doesn't change, and eliminate step (1) in the above process (if the larger file is the unchanging one, the process needs to be adjusted to suit).

Health warning: If the strings are generally pretty random (in a hand waving sort of a way), you'll probably be OK with a straightforward hash function. If there's any regularity, finding the right hash may be tricky, and one hash may not suit a different set of data with different properties. Anyway, be careful with the hash function and try before you buy !

Final thought, if there's a really fast CRC-32 module somewhere, I'd try that for the hash. Oh, final final though, I'd instrument the code so that you can tell how well the hash etc. are working.


In reply to Re: Comparing strings (exact matches) in LARGE numbers FAST by gone2015
in thread Comparing strings (exact matches) in LARGE numbers FAST by perlSD

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.