in reply to Re: Working with large amount of data
in thread Working with large amount of data

because it's easy to manage 100 or so tables, and handling the hashing (checking uniqueness) will be easier/faster when there's a smaller number of entries in a given table
I do not agree.

Doing some queries over data spread out over 100 tables is way more difficult (and slow) than doing the same query on a single table.

And isn't hashing supposed to be working as fast for large or small addres spaces? The hashing function should be O(1), if I remember well.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

  • Comment on Re^2: Working with large amount of data

Replies are listed 'Best First'.
Re^3: Working with large amount of data
by tilly (Archbishop) on Sep 21, 2009 at 06:44 UTC
    Hashing is O(1), but the constant involves a seek to memory. When the hash is too big to fit in RAM, that turns into a seek to disk. On a typical hard drive you should plan on an average of 0.01 seconds for the drive head to be correctly positioned for your read.

    If you do a few billion lookups each taking 0.01 seconds, you're talking about a year. That's usually not acceptable.

    Splitting up the problem into a series of subproblems that fit in RAM is a huge performance win. Despite the added complexity.

      That is true, but you will have to (re)load each hash over and over again when you need access to one of the 99 hashes which happen not to be in memory at the moment you need it.

      I'd rather trust the designers of the database to have managed this in a somewhat optimized way, rather than rolling my own cacheing system.

      I think 1 billion records with over 1 TB of data anyhow just cries for a professional database.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        No, you just have to make one pass through the data to split it into 100 pieces according to which hash is needed, then make one pass through each piece.

        As for professional databases, databases are not magic. If you understand what they are doing then it is possible to outperform them. Usually you don't want to go there, but when you need to it can be done. I have personally done it several times.