in reply to Re: Algorithm advice sought for seaching through GB's of text (email) files
in thread Algorithm advice sought for seaching through GB's of text (email) files
Why I wouldn't use a DB for this application (unless there is a known alternate and ongoing use of the data).
Reading from 300,000 files is going to be an IO-bound process. Doubly so, if you are then writing to a DB.
By the time you have read the contents of the emails and isolated the appropriate address(es), the cpu/realtime required for the lookup step in a hash is insignificant--hence IO-bound. Once the hash lookup is done the process is complete. No need for any other "big step".
The time taken for the lookup will also be insignificant compared to inserting those addresses into a DB. Besides the costs of communications with the DB, the DB will need to write the data to the filesystem, further competing for kernel IO time and slowing the readers.
Additionally, there will be syncronisation costs associated with multiple writers to a single database. And that's all before you get to the point of neding to re-read all the data written to the filesystem in order to do that unnecessary "big step" (join).
Unless you decide to only store a subset of the data, in which case, the chances are you will not have all the information available for more complex processing.
When it comes to "just executing a query" against that database later, you may save a little time if you have indexed the appropriate fileds from the original data. But if any of your query relies upon wild-card lookups in the unstructured part of the email--the body text--using LIKE %term% comparisons, then you will still need to process the same volumes of data from disc, but it will be much slower than using Perl's regex engine.
Writing my test app for this took 7 minutes, and a run took 20. How long did it take you to write your test app?
|
|---|