in reply to Re^2: Working with large amount of data
in thread Working with large amount of data

Hashing is O(1), but the constant involves a seek to memory. When the hash is too big to fit in RAM, that turns into a seek to disk. On a typical hard drive you should plan on an average of 0.01 seconds for the drive head to be correctly positioned for your read.

If you do a few billion lookups each taking 0.01 seconds, you're talking about a year. That's usually not acceptable.

Splitting up the problem into a series of subproblems that fit in RAM is a huge performance win. Despite the added complexity.

  • Comment on Re^3: Working with large amount of data

Replies are listed 'Best First'.
Re^4: Working with large amount of data
by CountZero (Bishop) on Sep 21, 2009 at 08:22 UTC
    That is true, but you will have to (re)load each hash over and over again when you need access to one of the 99 hashes which happen not to be in memory at the moment you need it.

    I'd rather trust the designers of the database to have managed this in a somewhat optimized way, rather than rolling my own cacheing system.

    I think 1 billion records with over 1 TB of data anyhow just cries for a professional database.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      No, you just have to make one pass through the data to split it into 100 pieces according to which hash is needed, then make one pass through each piece.

      As for professional databases, databases are not magic. If you understand what they are doing then it is possible to outperform them. Usually you don't want to go there, but when you need to it can be done. I have personally done it several times.

        Interesting. Suppose the data is randomly distributed over the 100 "buckets", you will have to save the data in 100 separate files (because you cannot keep all the data in memory) and then read each of these files and store the data in a hash, which you will have to save to disk again. Would that (once reading the big file, saving to 100 smaller files, reading each of these smaller files and saving to another 100 "hash"-files) be faster than going once through the big file and store the data in a database?

        Would selecting and retrieving data not be slower if you have to deal with 100 smaller files?

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James