Why I wouldn't use a DB for this application (unless there is a known alternate and ongoing use of the data).

  1. There is no point in paralliziing the population of the DB.

    Reading from 300,000 files is going to be an IO-bound process. Doubly so, if you are then writing to a DB.

    By the time you have read the contents of the emails and isolated the appropriate address(es), the cpu/realtime required for the lookup step in a hash is insignificant--hence IO-bound. Once the hash lookup is done the process is complete. No need for any other "big step".

    The time taken for the lookup will also be insignificant compared to inserting those addresses into a DB. Besides the costs of communications with the DB, the DB will need to write the data to the filesystem, further competing for kernel IO time and slowing the readers.

    Additionally, there will be syncronisation costs associated with multiple writers to a single database. And that's all before you get to the point of neding to re-read all the data written to the filesystem in order to do that unnecessary "big step" (join).

  2. If you take 2 GB of data from unstructured files and store it into a structured form in a database, it will require double, treble, or even quadruple the storage capacity of the original flatfiles depending upon how much of the original data you decide to structure. More if you fully index it.

    Unless you decide to only store a subset of the data, in which case, the chances are you will not have all the information available for more complex processing.

    When it comes to "just executing a query" against that database later, you may save a little time if you have indexed the appropriate fileds from the original data. But if any of your query relies upon wild-card lookups in the unstructured part of the email--the body text--using LIKE %term% comparisons, then you will still need to process the same volumes of data from disc, but it will be much slower than using Perl's regex engine.

  3. But will it be quicker to write the program to find it?

    Writing my test app for this took 7 minutes, and a run took 20. How long did it take you to write your test app?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^2: Algorithm advice sought for seaching through GB's of text (email) files by BrowserUk
in thread Algorithm advice sought for seaching through GB's of text (email) files by chargrill

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.