in reply to fast disk db with bulk insert, fast read access, compact storage

I would like to better understand this: I can get myself an 80GB SSD

You can get a nice 1TB hard drive for ~$80. Why is there this restriction? The indexing that a DB does takes space but it cuts down on the "seeking". Using more storage space to achieve higher overall performance is a common trade-off. Using 10x the storage may be faster even if accessing one particular "hunk" of data may be slower than SSD.

  • Comment on Re: fast disk db with bulk insert, fast read access, compact storage

Replies are listed 'Best First'.
Re^2: fast disk db with bulk insert, fast read access, compact storage
by BrowserUk (Patriarch) on Sep 15, 2010 at 14:37 UTC

    If the problem described was fetching a single record by key, you might just be right.

    But read between the OPs line a little and you can quite imagine that he is trying to implement something like Google's new auto-complete search thing.

    That means that each time the user types a keystroke, he has to:

    1. Re-query the index for a list of record numbers who keys start with the accumulated keystrokes so far.
    2. Then read (say) 10 records matching that prefix and present them to the user.

    And repeat that for each new keystroke.

    Under that scenario, the SSD will be invaluable.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Well actually I don't know exactly what the Op is trying to do and I think that Grandfather's questions are on point.

      What the Op was describing is reminiscent of a telephony feature in the US called "dial by name". A US phone has letters in addition to the touch pad numbers. Number 2 has ABC, number 3 has DEF, etc. For example BrowserUk is: 276973785 assuming that I did that translation right! in this case 7 can mean R or S. Anyway the way that this works for the user is that he/she starts typing letters and the system figures out according to a set of heuristics what to say to the user, eg are you close enough that I should give you say 3 options or keep my mouth shut and let you keep dialing, or say something to get you to keep dialing, etc...

      Anyway knowing what all of the entries in the directory are and being able to spend some CPU MIPs organizing that into an efficient data structure that the application can use is very helpful to say the least. More memory helps. This fixed vocabulary part will help a lot - minor additions can be done on the fly but this "rebuild every night" part will help a whole lot.

      At this point, I don't know enough info to say "hey I recommend to do X". I'm just at the point of asking "why do you think that Z is a "requirement/limitation"".?

        Hm. Not sure where in the OP you get the 'dial-by-name' idea from.

        With 1GB of keys averaging 8 chars, that's 128 million key/value pairs. And 31GB / 128MB = ave. 246 char values, which is a bit big for a telephone number.

        This bit of the OP seemed quite clear to me, hence my Google example:

        I want to do real-time search as my users are typing words.

        But I guess unless the OP comes back and clarifies, we won't know if I got it right or not.

        I'm currently playing with an indexer, that indexes each record by each character and position in the keys. I project it would take 84 minutes to index the 32GB; and produce a count of matching records within 50 milliseconds. That's from disk with a cold cache. Should be substantially faster using an SSD.

        For the described dataset, it would use 8GB of primary index and 1GB of secondary; which puts in the ballpark of the OPs requirements. Assuming that I read them correctly.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.