in reply to Algorithm advice sought for seaching through GB's of text (email) files

Instead of searching the 2 GB for each of the 15,000 email addresses, turn the problem around. Read through the 2 GB once, pick out the email address of interest from each file and test if it is one of the 15,000 by looking it up in a hash.

Loading 15,000 email addresses into a hash takes ~1 MB.

Processing 1 or 2 GB line by line, picking out the appropriate header line, extracting the email address and checking for it's existance in the hash shouldn't take more than a 3 or 4 minutes. Maybe less as the header line you are looking for should be near the top of each file, so you can skip reading most of each file.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re: Algorithm advice sought for seaching through GB's of text (email) files