in reply to Algorithm advice sought for seaching through GB's of text (email) files

The number of items in your data sets isn't the problem. You have 15,000 and 500,000, which are small enough to process easily with a database. The real issue is that one data set is highly unstructured.

There are a number of ways to skin this cat. But the key is that you're going to need to write a routine that takes an email message and produces a list of possible original recipients from that message. This step should be generous - you don't want to miss any.

Now pick your method. One approach is to create a hash whose keys are the 15,000 test email addresses and follow this pseudocode:

while (my $file = get_email_file()) { for my $email (get_possible_original_recipients($file)) { if ($in_test{$email}) { # Do something here. } } }
A secopnd approach is to create a table in a database with the fields (filename, email_in_file), populate another table with your test emails, then join them.

Personally I'd suggest following the second approach for several reasons.

  1. It is easily parallelizable - you can have multiple jobs populating that first table at once so you get through that big step faster.
  2. As you tweak the more complex later processing, it is faster to re-run. You can just execute a query rather than having to search gigabytes of emails every time you tweak it.
  3. This is generally useful information to have, you're likely to find other uses for it later.
  • Comment on Re: Algorithm advice sought for seaching through GB's of text (email) files
  • Download Code

Replies are listed 'Best First'.
Re^2: Algorithm advice sought for seaching through GB's of text (email) files
by BrowserUk (Patriarch) on Sep 24, 2006 at 14:37 UTC

    Why I wouldn't use a DB for this application (unless there is a known alternate and ongoing use of the data).

    1. There is no point in paralliziing the population of the DB.

      Reading from 300,000 files is going to be an IO-bound process. Doubly so, if you are then writing to a DB.

      By the time you have read the contents of the emails and isolated the appropriate address(es), the cpu/realtime required for the lookup step in a hash is insignificant--hence IO-bound. Once the hash lookup is done the process is complete. No need for any other "big step".

      The time taken for the lookup will also be insignificant compared to inserting those addresses into a DB. Besides the costs of communications with the DB, the DB will need to write the data to the filesystem, further competing for kernel IO time and slowing the readers.

      Additionally, there will be syncronisation costs associated with multiple writers to a single database. And that's all before you get to the point of neding to re-read all the data written to the filesystem in order to do that unnecessary "big step" (join).

    2. If you take 2 GB of data from unstructured files and store it into a structured form in a database, it will require double, treble, or even quadruple the storage capacity of the original flatfiles depending upon how much of the original data you decide to structure. More if you fully index it.

      Unless you decide to only store a subset of the data, in which case, the chances are you will not have all the information available for more complex processing.

      When it comes to "just executing a query" against that database later, you may save a little time if you have indexed the appropriate fileds from the original data. But if any of your query relies upon wild-card lookups in the unstructured part of the email--the body text--using LIKE %term% comparisons, then you will still need to process the same volumes of data from disc, but it will be much slower than using Perl's regex engine.

    3. But will it be quicker to write the program to find it?

      Writing my test app for this took 7 minutes, and a run took 20. How long did it take you to write your test app?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I was not talking about storing all of the email in the database. Just the email addresses found in the emails. This should be a few megabytes at most.

      About the I/O, I agree with you that this is going to be an I/O bound job. However I/O bound jobs tend to get limited by latency on waiting for disk. That means that they are ideal candidates for parallelization. (The disk spins at the same speed no matter how many readers are waiting for their sector to come under the disk head.) The additional I/O of writing to the database is insignificant compared to the reading overhead. And the locking logic - well that is built into databases already.

      How long does it take to write, how long to run? Well I didn't do that. Lemme try writing most of it now, untested, off of the top of my head.

      #!/usr/bin/perl -w use strict; use Email::Find qw(find_emails); use DBI; use File::Slurp qw(slurp); my $dbh = DBI->connect(insert appropriate details here); my $sth = $dbh->prepare(qq( INSERT INTO file_has_email (filename, email) VALUES (?, ?) )) or die "Cannot prepare: " . $dbh->errstr; for my $file (@ARGV) { find_emails(scalar slurp($file), sub { my $email = shift; $sth->execute($file, $email->format) or die "Cannot execute with ($file, " . $email->format . "): " . $dbh->errstr; return shift; }); } $dbh->commit() or die "Cannot commit: " . $dbh->errmsg;
      OK. I forgot to time myself but that took 5-10 minutes. Based on past experience I'd expect that you'd be able to get full speed out of 5-10 copies of that in parallel. Assuming similar performance to your version, that means it would take 2-4 minutes per run.

      So if things are set up already, the database approach probably does get you the answer quicker. If you have to set up a database from scratch, it doesn't.

      However think about what happens as you are developing your code to take a closer look at those emails and try to figure out why it was miscategorized and what to do about it. For that it is easy to have a small test list of, say, 50 emails and execute your code against that list. Those test runs will take on the order of seconds (how many depends on how complex your processing of the matches are), which results in quick development turnaround time. Then when you are fairly confident in the output, you can put the full list of 15,000 in and do a full run. My bet is that the biggest investment of human time is going to be in developing that follow-up code, and a quick testing turnaround on that is very helpful.

      Update: Forgot the all-important commit. (Of course as soon as I ran it on a test file, I would have noticed. And in real life I start all my $work scripts with a standard template that already has the connect and commit built in.)

        (The disk spins at the same speed no matter how many readers are waiting for their sector to come under the disk head.)

        That's way too simplistic.

        The rotation of the disk is not the lmiting factor, it is head movement. With one process moving serially through 300,000 files, successive reads by a single process can readily expect to continue reading from the same track. With multiple processes reading from different files, each context switch (overhead) is going to also cause the additional (high cost) overhead of a head move.

        Additionally, with small files (emails), when the process calls for a 512 byte or 4k read, the device driver will likely read the entire file (if it happens to be contiguous on a track), and store it in cache. However, the more processes reading new data from disk, the greater the likelyhood that the cached (but unread) data for one process/file, will need to be discarded to accomodate reads by another process after a context switch, with the result that some data will need to be read multiple times.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      You're worried about using more than 2GB of hard drive space? Isn't the going rate about $1 per GB? Space doesn't seem like an issue here, with such a small set of data.
        You're worried about using more than 2GB of hard drive space?

        Of course not. How did you arrive at that question?

        The point is, to process the data, it has to be read from disc regardless of whether it is read directly or through a DB. But to process it through the DB, it has first to be read (from the flat files), then written (to the DB files and indexes), then re-read (from/via the DB and indexes).

        Sure, if the data is structured and can be indexed in a manner that aids the query, then the final re-read may entail reading less data than the original read--but it is still duplicated or triplicated effort unless there is a known future benefit from having it stored.

        And, in an IO-bound process, all that extra IO does nothing to facilitate performance improvements through the use of parallelization. One of tilly's cited benefits.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.