I was not talking about storing all of the email in the database. Just the email addresses found in the emails. This should be a few megabytes at most.
About the I/O, I agree with you that this is going to be an I/O bound job. However I/O bound jobs tend to get limited by latency on waiting for disk. That means that they are ideal candidates for parallelization. (The disk spins at the same speed no matter how many readers are waiting for their sector to come under the disk head.) The additional I/O of writing to the database is insignificant compared to the reading overhead. And the locking logic - well that is built into databases already.
How long does it take to write, how long to run? Well I didn't do that. Lemme try writing most of it now, untested, off of the top of my head.
#!/usr/bin/perl -w
use strict;
use Email::Find qw(find_emails);
use DBI;
use File::Slurp qw(slurp);
my $dbh = DBI->connect(insert appropriate details here);
my $sth = $dbh->prepare(qq(
INSERT INTO file_has_email (filename, email)
VALUES (?, ?)
)) or die "Cannot prepare: " . $dbh->errstr;
for my $file (@ARGV) {
find_emails(scalar slurp($file), sub {
my $email = shift;
$sth->execute($file, $email->format)
or die "Cannot execute with ($file, "
. $email->format . "): " . $dbh->errstr;
return shift;
});
}
$dbh->commit() or die "Cannot commit: " . $dbh->errmsg;
OK. I forgot to time myself but that took 5-10 minutes. Based on past experience I'd expect that you'd be able to get full speed out of 5-10 copies of that in parallel. Assuming similar performance to your version, that means it would take 2-4 minutes per run.
So if things are set up already, the database approach probably does get you the answer quicker. If you have to set up a database from scratch, it doesn't.
However think about what happens as you are developing your code to take a closer look at those emails and try to figure out why it was miscategorized and what to do about it. For that it is easy to have a small test list of, say, 50 emails and execute your code against that list. Those test runs will take on the order of seconds (how many depends on how complex your processing of the matches are), which results in quick development turnaround time. Then when you are fairly confident in the output, you can put the full list of 15,000 in and do a full run. My bet is that the biggest investment of human time is going to be in developing that follow-up code, and a quick testing turnaround on that is very helpful.
Update: Forgot the all-important commit. (Of course as soon as I ran it on a test file, I would have noticed. And in real life I start all my $work scripts with a standard template that already has the connect and commit built in.) |