in reply to Re^5: Working with large amount of data
in thread Working with large amount of data

Interesting. Suppose the data is randomly distributed over the 100 "buckets", you will have to save the data in 100 separate files (because you cannot keep all the data in memory) and then read each of these files and store the data in a hash, which you will have to save to disk again. Would that (once reading the big file, saving to 100 smaller files, reading each of these smaller files and saving to another 100 "hash"-files) be faster than going once through the big file and store the data in a database?

Would selecting and retrieving data not be slower if you have to deal with 100 smaller files?

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

  • Comment on Re^6: Working with large amount of data

Replies are listed 'Best First'.
Re^7: Working with large amount of data
by tilly (Archbishop) on Sep 21, 2009 at 20:03 UTC
    Whenever you're considering a design remember the following. Seeks to disk cost 0.01 seconds each, but it is reasonable to stream data back and forth at 50 MB/s. Therefore it is worth doing a lot of extra work to be able to stream data rather than seeking to disk.

    If you've done things right, for large data sets your time is entirely dominated by the time to stream through data. So 1 file vs 100 files is irrelevant. But splitting directly into 100 files may be a horrible idea for the simple reason that disk drives are typically able to stream data at high rates to a fixed number of locations. Like 4 or 16. So you'd probably want to split the data in multiple passes if you went with this design. (That is not to say that this is the right design. Personally I head in the merge sort direction rather than using hashing.)

    As for going to a database, my experience is that when your data sets are near the capacity of the machine, databases often will run into resource constraints and not figure a way out. It isn't that the query runs painfully slowly, it is that it grinds away for several hours then the query crashes. That is one of the prime reasons that I have needed to do end runs around the database when working with large data sets.

      But splitting directly into 100 files may be a horrible idea for the simple reason that disk drives are typically able to stream data at high rates to a fixed number of locations. Like 4 or 16

      Are you completely sure about that?

      AFAIK the OS and the file system layer should mitigate any hardware limitation like that. Writes are cached and reordered before sending them to the hardware, so it shouldn't be any difference between writing to a file or to a thousand...

      Well, unless you have your file system configured to commit everything immediately, but this is not common because of the huge performance penalty it imposes!

        I'm not completely sure about that. I had some bad experiences with Linux and disk drives a decade ago that have left me suspicious of how good the OS is at caching and reordering stuff. Things are certainly better now, but how much better I do not know.

        Put it this way. If I was solving this problem on this hardware, I'd be sure to do some trial runs on smaller sets. And one thing I'd be testing is how many pieces to split a file into in one pass. Because it could matter.