I was not talking about storing all of the email in the database. Just the email addresses found in the emails. This should be a few megabytes at most.

About the I/O, I agree with you that this is going to be an I/O bound job. However I/O bound jobs tend to get limited by latency on waiting for disk. That means that they are ideal candidates for parallelization. (The disk spins at the same speed no matter how many readers are waiting for their sector to come under the disk head.) The additional I/O of writing to the database is insignificant compared to the reading overhead. And the locking logic - well that is built into databases already.

How long does it take to write, how long to run? Well I didn't do that. Lemme try writing most of it now, untested, off of the top of my head.

#!/usr/bin/perl -w use strict; use Email::Find qw(find_emails); use DBI; use File::Slurp qw(slurp); my $dbh = DBI->connect(insert appropriate details here); my $sth = $dbh->prepare(qq( INSERT INTO file_has_email (filename, email) VALUES (?, ?) )) or die "Cannot prepare: " . $dbh->errstr; for my $file (@ARGV) { find_emails(scalar slurp($file), sub { my $email = shift; $sth->execute($file, $email->format) or die "Cannot execute with ($file, " . $email->format . "): " . $dbh->errstr; return shift; }); } $dbh->commit() or die "Cannot commit: " . $dbh->errmsg;
OK. I forgot to time myself but that took 5-10 minutes. Based on past experience I'd expect that you'd be able to get full speed out of 5-10 copies of that in parallel. Assuming similar performance to your version, that means it would take 2-4 minutes per run.

So if things are set up already, the database approach probably does get you the answer quicker. If you have to set up a database from scratch, it doesn't.

However think about what happens as you are developing your code to take a closer look at those emails and try to figure out why it was miscategorized and what to do about it. For that it is easy to have a small test list of, say, 50 emails and execute your code against that list. Those test runs will take on the order of seconds (how many depends on how complex your processing of the matches are), which results in quick development turnaround time. Then when you are fairly confident in the output, you can put the full list of 15,000 in and do a full run. My bet is that the biggest investment of human time is going to be in developing that follow-up code, and a quick testing turnaround on that is very helpful.

Update: Forgot the all-important commit. (Of course as soon as I ran it on a test file, I would have noticed. And in real life I start all my $work scripts with a standard template that already has the connect and commit built in.)


In reply to Re^3: Algorithm advice sought for seaching through GB's of text (email) files by tilly
in thread Algorithm advice sought for seaching through GB's of text (email) files by chargrill

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.