As indicated elsewhere, the quick and easy way is to use somebody else's carefully crafted sort program to sort both files and then run them against each other.
However, you did ask:
What approach would be really super-fast?
If the quick and easy approach is more or less fast enough, stop now. If you're not going to do this very often, stop now. Otherwise...
There are ~2 * 10^19 different 32 character strings. You have ~n * 10^7 of those and ~n * 10^8 you want to match against. This suggests a fairly low hit rate. Yes ?
If so, then we apply rule some-small-integer, and try to hack the problem down to size. Rather than attempt a lot of exact matching, I suggest the trick here is first to use some inexact matching, to eliminate strings for which there is definitely no match.
The obvious way to do that is to hash the strings into a largish integer and use a bit vector to record which hashes exist in the smaller file. I'd try a 30-bit hash, but 32 might work better (assuming a 512MB bit vector is not a shock to your system).
The process is then:
If both files are going to be different each time, we're done. Otherwise, you could build and save the bit-vector for the file that doesn't change, and eliminate step (1) in the above process (if the larger file is the unchanging one, the process needs to be adjusted to suit).
Health warning: If the strings are generally pretty random (in a hand waving sort of a way), you'll probably be OK with a straightforward hash function. If there's any regularity, finding the right hash may be tricky, and one hash may not suit a different set of data with different properties. Anyway, be careful with the hash function and try before you buy !
Final thought, if there's a really fast CRC-32 module somewhere, I'd try that for the hash. Oh, final final though, I'd instrument the code so that you can tell how well the hash etc. are working.
In reply to Re: Comparing strings (exact matches) in LARGE numbers FAST
by gone2015
in thread Comparing strings (exact matches) in LARGE numbers FAST
by perlSD
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |