Re: Matching data between huge files

How big is really big? I see no problem with loading a million keys into memory. For other approaches, I've used a tied hash like SDBM_File to store the keys on disk. A possibly even faster approach is to sort both files by key, or have them retrieved in sorted fashion. Then you can do away with the hash alltogether and use a variation of a merge sort, advancing through both files line by line, discarding/keeping lines that match.

Personally, I would first try the "slurp one file into a hash" approach, because that's the simplest approach. Even if you go slightly over the amount of physical memory in the box, the decline in runtime will likely not be worth any effort spent on this. And it's always easy to modify such a program to use a tied hash on disk instead.

Comment on Re: Matching data between huge files