in reply to Comparing strings (exact matches) in LARGE numbers FAST

As indicated elsewhere, the quick and easy way is to use somebody else's carefully crafted sort program to sort both files and then run them against each other.

However, you did ask:

What approach would be really super-fast?

If the quick and easy approach is more or less fast enough, stop now. If you're not going to do this very often, stop now. Otherwise...

There are ~2 * 10^19 different 32 character strings. You have ~n * 10^7 of those and ~n * 10^8 you want to match against. This suggests a fairly low hit rate. Yes ?

If so, then we apply rule some-small-integer, and try to hack the problem down to size. Rather than attempt a lot of exact matching, I suggest the trick here is first to use some inexact matching, to eliminate strings for which there is definitely no match.

The obvious way to do that is to hash the strings into a largish integer and use a bit vector to record which hashes exist in the smaller file. I'd try a 30-bit hash, but 32 might work better (assuming a 512MB bit vector is not a shock to your system).

The process is then:

  1. scan the smaller file and populate the bit-vector according to the hash values.
  2. scan the larger file and record any strings whose hash values appear in the bit-vector.
  3. with any luck, there are now a small number of candidate strings, small enoough that is to be held in memory -- in particular in a hash table.
  4. scan the small file again, and see if anything is found in the hash table.

If both files are going to be different each time, we're done. Otherwise, you could build and save the bit-vector for the file that doesn't change, and eliminate step (1) in the above process (if the larger file is the unchanging one, the process needs to be adjusted to suit).

Health warning: If the strings are generally pretty random (in a hand waving sort of a way), you'll probably be OK with a straightforward hash function. If there's any regularity, finding the right hash may be tricky, and one hash may not suit a different set of data with different properties. Anyway, be careful with the hash function and try before you buy !

Final thought, if there's a really fast CRC-32 module somewhere, I'd try that for the hash. Oh, final final though, I'd instrument the code so that you can tell how well the hash etc. are working.

  • Comment on Re: Comparing strings (exact matches) in LARGE numbers FAST

Replies are listed 'Best First'.
Re^2: Comparing strings (exact matches) in LARGE numbers FAST
by perlSD (Novice) on Aug 29, 2008 at 20:14 UTC
    Thanks, the idea of saving time by looking at imperfect matches first to reduce the number of comparisons needed will definitely save time.
    The larger file will be actually the same from one run to another, so yes, I've thought of storing either the sorted file (if I go with a Unix-like approach like comm or uniq) or the bitvector for that file to do it only once and save time in the subsequent runs.
    The strings are pretty random.