Re: Searching Huge files

You have to put one of the two files into a hash, it doesn't really matter which one. Since the hash won't fit into memory, it must be put on disc, either into a real database like mysql or in this case better into a DBM::Deep database.

Then just loop through the other file and look up the ids in the hash. Should be a speedup million fold speedup, literally.

Comment on Re: Searching Huge files

Replies are listed 'Best First'.
Re^2: Searching Huge files by graff (Chancellor) on Jul 08, 2008 at 03:35 UTC
You have to put one of the two files into a hash, it doesn't really matter which one. Actually, there's a good chance that it does matter. If one file has about 2 million rows/keys, and the other has about 8 million, it will take noticeably less resources and time to store the keys of the smaller file into a hash. As GrandFather suggested above, there's a reasonable chance that a hash of 2 million elements could fit into RAM without causing the machine to flail due to the virtual memory content being bounced back and forth between RAM and swap file. But whether it's in-memory or in a DBM file of some sort, creating 2 million keys will be quicker than 8 million (and it would just seem to make more sense). Of course, once a hash has been built, access time is not likely to differ all that much (except when an "in-memory" hash is big enough to induce swapping), but the time/space needed to build the hash may differ significantly depending on the quantity of elements involved.	[reply]

Replies are listed 'Best First'.

Re^2: Searching Huge files
by graff (Chancellor) on Jul 08, 2008 at 03:35 UTC

You have to put one of the two files into a hash, it doesn't really matter which one.

Actually, there's a good chance that it does matter. If one file has about 2 million rows/keys, and the other has about 8 million, it will take noticeably less resources and time to store the keys of the smaller file into a hash. As GrandFather suggested above, there's a reasonable chance that a hash of 2 million elements could fit into RAM without causing the machine to flail due to the virtual memory content being bounced back and forth between RAM and swap file.

But whether it's in-memory or in a DBM file of some sort, creating 2 million keys will be quicker than 8 million (and it would just seem to make more sense). Of course, once a hash has been built, access time is not likely to differ all that much (except when an "in-memory" hash is big enough to induce swapping), but the time/space needed to build the hash may differ significantly depending on the quantity of elements involved.

[reply]