in reply to Re^2: Algorithm advice sought for seaching through GB's of text (email) files
in thread Algorithm advice sought for seaching through GB's of text (email) files

I was not talking about storing all of the email in the database. Just the email addresses found in the emails. This should be a few megabytes at most.

About the I/O, I agree with you that this is going to be an I/O bound job. However I/O bound jobs tend to get limited by latency on waiting for disk. That means that they are ideal candidates for parallelization. (The disk spins at the same speed no matter how many readers are waiting for their sector to come under the disk head.) The additional I/O of writing to the database is insignificant compared to the reading overhead. And the locking logic - well that is built into databases already.

How long does it take to write, how long to run? Well I didn't do that. Lemme try writing most of it now, untested, off of the top of my head.

#!/usr/bin/perl -w use strict; use Email::Find qw(find_emails); use DBI; use File::Slurp qw(slurp); my $dbh = DBI->connect(insert appropriate details here); my $sth = $dbh->prepare(qq( INSERT INTO file_has_email (filename, email) VALUES (?, ?) )) or die "Cannot prepare: " . $dbh->errstr; for my $file (@ARGV) { find_emails(scalar slurp($file), sub { my $email = shift; $sth->execute($file, $email->format) or die "Cannot execute with ($file, " . $email->format . "): " . $dbh->errstr; return shift; }); } $dbh->commit() or die "Cannot commit: " . $dbh->errmsg;
OK. I forgot to time myself but that took 5-10 minutes. Based on past experience I'd expect that you'd be able to get full speed out of 5-10 copies of that in parallel. Assuming similar performance to your version, that means it would take 2-4 minutes per run.

So if things are set up already, the database approach probably does get you the answer quicker. If you have to set up a database from scratch, it doesn't.

However think about what happens as you are developing your code to take a closer look at those emails and try to figure out why it was miscategorized and what to do about it. For that it is easy to have a small test list of, say, 50 emails and execute your code against that list. Those test runs will take on the order of seconds (how many depends on how complex your processing of the matches are), which results in quick development turnaround time. Then when you are fairly confident in the output, you can put the full list of 15,000 in and do a full run. My bet is that the biggest investment of human time is going to be in developing that follow-up code, and a quick testing turnaround on that is very helpful.

Update: Forgot the all-important commit. (Of course as soon as I ran it on a test file, I would have noticed. And in real life I start all my $work scripts with a standard template that already has the connect and commit built in.)

  • Comment on Re^3: Algorithm advice sought for seaching through GB's of text (email) files
  • Download Code

Replies are listed 'Best First'.
Re^4: Algorithm advice sought for seaching through GB's of text (email) files
by BrowserUk (Patriarch) on Sep 24, 2006 at 19:17 UTC
    (The disk spins at the same speed no matter how many readers are waiting for their sector to come under the disk head.)

    That's way too simplistic.

    The rotation of the disk is not the lmiting factor, it is head movement. With one process moving serially through 300,000 files, successive reads by a single process can readily expect to continue reading from the same track. With multiple processes reading from different files, each context switch (overhead) is going to also cause the additional (high cost) overhead of a head move.

    Additionally, with small files (emails), when the process calls for a 512 byte or 4k read, the device driver will likely read the entire file (if it happens to be contiguous on a track), and store it in cache. However, the more processes reading new data from disk, the greater the likelyhood that the cached (but unread) data for one process/file, will need to be discarded to accomodate reads by another process after a context switch, with the result that some data will need to be read multiple times.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      At some point you are right, contention becomes an issue. However I have found as a practical matter on commodity hardware that that point isn't hit until you have 5-10 parallel processes reading.

      Of course I last looked into this a while ago. The figure is undoubtably different today. Benchmark for your purposes.

      But my point remains. I/O bound jobs are excellent candidates for parallelization. Due to various sources of latency, one process generally comes nowhere close to maxing out I/O. If you don't have to worry about getting in other people's way, it almost always makes sense to have multiple parallel processes.

        Due to various sources of latency, one process generally comes nowhere close to maxing out I/O

        Hmm. See threads and multiple filehandles for the most recent example were this has been shown to not be the case.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.