in reply to Re: fast disk db with bulk insert, fast read access, compact storage
in thread fast disk db with bulk insert, fast read access, compact storage

thanks everyone for the pointed questions. I had tried to keep my question reasonably general, because I thought it would be better for others who have similar questions.

the specific application is a data base of published articles. think of

  unique-key|Time Magazine|Why the monks are great|Sep 13, 2006|p245-133|volume 8|number 10

the data base, in plain text and this form, is about 5GB now (but could grow to 20GB in the future), and so the ASCII version fits into RAM. usually the DB changes, say, once per month. I could rebuild it every time anew from the store. there is no guarantee on length or uniqueness of anything, except the unique key.

I do need quick access into individual words. so, if I want to find all articles that contain the word 'Time' and the work 'monks' and the number 245, my search should be blindingly fast to find all unique-keys that contain the three words, and then display these records. assume access is very frequent, too---say, I wanted to do research that does 'permutation of words' research, so each article launches a search over the data base.

the lazy implementation would be to take every word, and put each word as key into a hash with the value being the arrays of unique keys where the word occurs; and a second hash which gives me the record given a unique key. of course, with perl hashes, this would take too much space. from my limited experience with SQL, after I rearrange the data, it would also blow up a lot.

on the plus side, this is all "read-only".

sql would be ok, but it just feels like it is not the right tool for the job. sql dbs seem made more for updating than for blindingly fast read access.

I was also only guessing that SSD would be a good tool for the job.

help?

  • Comment on Re^2: fast disk db with bulk insert, fast read access, compact storage
  • Download Code

Replies are listed 'Best First'.
Re^3: fast disk db with bulk insert, fast read access, compact storage
by BrowserUk (Patriarch) on Sep 18, 2010 at 08:24 UTC

    Your original description of your application:

    simple---key, value. ... 32GB ... (think as application of a word hash that I am rebuilding every night, and I want to do real-time search as my users are typing words.)

    Is almost completely at odds with this description:

    is about 5GB now (but could grow to 20GB in the future) ... the DB changes, say, once per month ... I do need quick access into individual words. so, if I want to find all articles that contain the word 'Time' and the work 'monks' and the number 245, my search should be blindingly fast to find all unique-keys that contain the three words, and then display these records.

    The former implies indexing by the characters of the unique key only.

    The latter requires a fully inverted index of the words in the entire records, which essentially makes the unique key redundant.

    You need to define the actual use your data will be put to, before looking for the mechanism for doing it.

      actually, I think the db descriptions are pretty much the same (5GB for testing now, 20GB in the future, so 32GB is a good upper limit), although neither description was very good. however, what is very good is that you told me what it is that I am really looking for: an inverted index. thanks a lot. very helpful. I should be able to search for this now in a much more intelligent fashion.

      so I need a nice, fast inverted index program for ubuntu perl. knowing what I need, I could now search cpan. as luck would have it, Search::Moose seems to be designed for this sort of job. (as bad luck would have it, it aborts during the build stage on my ubuntu machine. if someone knows more about Search::Moose, please let me know.)

      thanks a lot, everybody.

      regards, /iaw

        I think the db descriptions are pretty much the same

        I guess we read different things.

        as luck would have it, Search::Moose seems to be designed for this sort of job.

        Hm. cpan didn't show up anything when I searched for Search::Moose.

        And, if you're looking for "blindingly fast", anything with "Moose" in the title probably isn't going to cut it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.