in reply to Algorithm advice sought for seaching through GB's of text (email) files
Instead of searching the 2 GB for each of the 15,000 email addresses, turn the problem around. Read through the 2 GB once, pick out the email address of interest from each file and test if it is one of the 15,000 by looking it up in a hash.
Loading 15,000 email addresses into a hash takes ~1 MB.
Processing 1 or 2 GB line by line, picking out the appropriate header line, extracting the email address and checking for it's existance in the hash shouldn't take more than a 3 or 4 minutes. Maybe less as the header line you are looking for should be near the top of each file, so you can skip reading most of each file.
|
|---|