Algorithm advice sought for seaching through GB's of text (email) files

chargrill has asked for the wisdom of the Perl Monks concerning the following question:

Greetings:

I've done some searching, but the search keywords I've thought of thus far haven't really given me much insight. I haven't written any code, nor am I looking for anyone to do so - instead, I'm asking for advice on an approach.

The situation:

My company runs a few mailing lists. Opt-in, of course. Rather large.
I've got 15,000 email addresses that my company has sent a test email to. We're trying to figure out why we have recently seen a statistically large number of "hard bounces" - emails that trigger our mail processor to auto-unsubscribe the list member. (We believe that the numbers we're seeing are due to some auto-filter by ISP's and/or AOL, and we want to be sure the bounces are valid. Or not.)
I have a machine that collects all inbound email, whether it be bounce messages, "unsubscribe" requests, or spam sent to the inbound mailbox.
At last count, I had about 500,000 emails (full headers, and sorted into a few directories, perhaps as many as 200k-300k per directory, based on some programmatic decision made by those who came before me, be they "bounces" or "failures" or "unknown") that might contain a bounce or other message (that perhaps our mail sorting program didn't known how to handle appropriately) for that list of 15k email addresses.

What I'm looking for is an idea of how to sift through what's likely to be about 300,000 email messages at a time (which equates to roughly a GB or two of text data), and look for messages that were originally intended to go to these 15,000 email addresses, figure out what the bounce message it is, and handle it appropriately. Or figure out why we're not handling it appropriately. Of course, handling it appropriately is either removing it from our database, marking it as a "soft bounce" (mailbox is full, etc), ignoring it, adding a "does not accept HTML email" flag to our database, etc. But that part is not what I need help with :-)

So I'm looking for a memory efficient algorithm/approach (that hopefully wouldn't be too slow, but somwhat slow speed I can live with) for handling this - I honestly can't think of a good way to do this other than writing a thin wrapper around 15k grep's, or some other resource intensive brute-force method.

Any and all input welcomed!

--chargrill

s**lil*;  $*=join'',sort split q**;  s;.*;grr; &&s+(.(.)).+$2$1+; $; =
qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
[download]

Comment on Algorithm advice sought for seaching through GB's of text (email) files Select or Download Code

Replies are listed 'Best First'.
Re: Algorithm advice sought for seaching through GB's of text (email) files by BrowserUk (Patriarch) on Sep 24, 2006 at 06:14 UTC
Instead of searching the 2 GB for each of the 15,000 email addresses, turn the problem around. Read through the 2 GB once, pick out the email address of interest from each file and test if it is one of the 15,000 by looking it up in a hash. Loading 15,000 email addresses into a hash takes ~1 MB. Processing 1 or 2 GB line by line, picking out the appropriate header line, extracting the email address and checking for it's existance in the hash shouldn't take more than a 3 or 4 minutes. Maybe less as the header line you are looking for should be near the top of each file, so you can skip reading most of each file. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Algorithm advice sought for seaching through GB's of text (email) files by tilly (Archbishop) on Sep 24, 2006 at 06:14 UTC
The number of items in your data sets isn't the problem. You have 15,000 and 500,000, which are small enough to process easily with a database. The real issue is that one data set is highly unstructured. There are a number of ways to skin this cat. But the key is that you're going to need to write a routine that takes an email message and produces a list of possible original recipients from that message. This step should be generous - you don't want to miss any. Now pick your method. One approach is to create a hash whose keys are the 15,000 test email addresses and follow this pseudocode: `while (my $file = get_email_file()) { for my $email (get_possible_original_recipients($file)) { if ($in_test{$email}) { # Do something here. } } }` [download] A secopnd approach is to create a table in a database with the fields (filename, email_in_file), populate another table with your test emails, then join them. Personally I'd suggest following the second approach for several reasons. It is easily parallelizable - you can have multiple jobs populating that first table at once so you get through that big step faster. As you tweak the more complex later processing, it is faster to re-run. You can just execute a query rather than having to search gigabytes of emails every time you tweak it. This is generally useful information to have, you're likely to find other uses for it later.	[reply] [d/l]
Re^2: Algorithm advice sought for seaching through GB's of text (email) files by BrowserUk (Patriarch) on Sep 24, 2006 at 14:37 UTC
Why I wouldn't use a DB for this application (unless there is a known alternate and ongoing use of the data). There is no point in paralliziing the population of the DB. Reading from 300,000 files is going to be an IO-bound process. Doubly so, if you are then writing to a DB. By the time you have read the contents of the emails and isolated the appropriate address(es), the cpu/realtime required for the lookup step in a hash is insignificant--hence IO-bound. Once the hash lookup is done the process is complete. No need for any other "big step". The time taken for the lookup will also be insignificant compared to inserting those addresses into a DB. Besides the costs of communications with the DB, the DB will need to write the data to the filesystem, further competing for kernel IO time and slowing the readers. Additionally, there will be syncronisation costs associated with multiple writers to a single database. And that's all before you get to the point of neding to re-read all the data written to the filesystem in order to do that unnecessary "big step" (join). If you take 2 GB of data from unstructured files and store it into a structured form in a database, it will require double, treble, or even quadruple the storage capacity of the original flatfiles depending upon how much of the original data you decide to structure. More if you fully index it. Unless you decide to only store a subset of the data, in which case, the chances are you will not have all the information available for more complex processing. When it comes to "just executing a query" against that database later, you may save a little time if you have indexed the appropriate fileds from the original data. But if any of your query relies upon wild-card lookups in the unstructured part of the email--the body text--using `LIKE %term%` comparisons, then you will still need to process the same volumes of data from disc, but it will be much slower than using Perl's regex engine. But will it be quicker to write the program to find it? Writing my test app for this took 7 minutes, and a run took 20. How long did it take you to write your test app? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^3: Algorithm advice sought for seaching through GB's of text (email) files by tilly (Archbishop) on Sep 24, 2006 at 18:40 UTC
I was not talking about storing all of the email in the database. Just the email addresses found in the emails. This should be a few megabytes at most. About the I/O, I agree with you that this is going to be an I/O bound job. However I/O bound jobs tend to get limited by latency on waiting for disk. That means that they are ideal candidates for parallelization. (The disk spins at the same speed no matter how many readers are waiting for their sector to come under the disk head.) The additional I/O of writing to the database is insignificant compared to the reading overhead. And the locking logic - well that is built into databases already. How long does it take to write, how long to run? Well I didn't do that. Lemme try writing most of it now, untested, off of the top of my head. #!/usr/bin/perl -w use strict; use Email::Find qw(find_emails); use DBI; use File::Slurp qw(slurp); my $dbh = DBI->connect(insert appropriate details here); my $sth = $dbh->prepare(qq( INSERT INTO file_has_email (filename, email) VALUES (?, ?) )) or die "Cannot prepare: " . $dbh->errstr; for my $file (@ARGV) { find_emails(scalar slurp($file), sub { my $email = shift; $sth->execute($file, $email->format) or die "Cannot execute with ($file, " . $email->format . "): " . $dbh->errstr; return shift; }); } $dbh->commit() or die "Cannot commit: " . $dbh->errmsg; [download] OK. I forgot to time myself but that took 5-10 minutes. Based on past experience I'd expect that you'd be able to get full speed out of 5-10 copies of that in parallel. Assuming similar performance to your version, that means it would take 2-4 minutes per run. So if things are set up already, the database approach probably does get you the answer quicker. If you have to set up a database from scratch, it doesn't. However think about what happens as you are developing your code to take a closer look at those emails and try to figure out why it was miscategorized and what to do about it. For that it is easy to have a small test list of, say, 50 emails and execute your code against that list. Those test runs will take on the order of seconds (how many depends on how complex your processing of the matches are), which results in quick development turnaround time. Then when you are fairly confident in the output, you can put the full list of 15,000 in and do a full run. My bet is that the biggest investment of human time is going to be in developing that follow-up code, and a quick testing turnaround on that is very helpful. Update: Forgot the all-important commit. (Of course as soon as I ran it on a test file, I would have noticed. And in real life I start all my $work scripts with a standard template that already has the connect and commit built in.)	[reply] [d/l]
Re^4: Algorithm advice sought for seaching through GB's of text (email) files by BrowserUk (Patriarch) on Sep 24, 2006 at 19:17 UTC
Re^5: Algorithm advice sought for seaching through GB's of text (email) files by tilly (Archbishop) on Sep 24, 2006 at 20:19 UTC
Some notes below your chosen depth have not been shown here
Re^3: Algorithm advice sought for seaching through GB's of text (email) files by perrin (Chancellor) on Sep 24, 2006 at 16:21 UTC
You're worried about using more than 2GB of hard drive space? Isn't the going rate about $1 per GB? Space doesn't seem like an issue here, with such a small set of data.	[reply]
Re^4: Algorithm advice sought for seaching through GB's of text (email) files by BrowserUk (Patriarch) on Sep 24, 2006 at 17:39 UTC
Re^5: Algorithm advice sought for seaching through GB's of text (email) files by perrin (Chancellor) on Sep 24, 2006 at 23:16 UTC