in reply to Re^3: Algorithm advice sought for seaching through GB's of text (email) files
in thread Algorithm advice sought for seaching through GB's of text (email) files

You're worried about using more than 2GB of hard drive space?

Of course not. How did you arrive at that question?

The point is, to process the data, it has to be read from disc regardless of whether it is read directly or through a DB. But to process it through the DB, it has first to be read (from the flat files), then written (to the DB files and indexes), then re-read (from/via the DB and indexes).

Sure, if the data is structured and can be indexed in a manner that aids the query, then the final re-read may entail reading less data than the original read--but it is still duplicated or triplicated effort unless there is a known future benefit from having it stored.

And, in an IO-bound process, all that extra IO does nothing to facilitate performance improvements through the use of parallelization. One of tilly's cited benefits.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^4: Algorithm advice sought for seaching through GB's of text (email) files

Replies are listed 'Best First'.
Re^5: Algorithm advice sought for seaching through GB's of text (email) files
by perrin (Chancellor) on Sep 24, 2006 at 23:16 UTC

    I drew that conclusion from this:

    If you take 2 GB of data from unstructured files and store it into a structured form in a database, it will require double, treble, or even quadruple the storage capacity of the original flatfiles depending upon how much of the original data you decide to structure. More if you fully index it.

    Maybe you meant something else, but that's what it sounded like.