fast disk db with bulk insert, fast read access, compact storage

iaw4 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: fast disk db with bulk insert, fast read access, compact storage by GrandFather (Saint) on Sep 15, 2010 at 05:40 UTC
Maybe you should tell us a little more about your actual application before we get too distracted by "I can't do X because Y" without having any justification for the assertion? Interesting additional information may be: How many read accesses per (whatever appropriate period of time)? Will the same 'word' be accessed multiple times per period? Is the same root comprising the first n characters likely to be accessed multiple times per period What is an appropriate period? What are the constraints on the time it takes to build the 'database'? What is 'very fast'? Is there any point compressing the values? What else might influence a smart solution? True laziness is hard work	[reply]
Re^2: fast disk db with bulk insert, fast read access, compact storage by iaw4 (Monk) on Sep 18, 2010 at 07:25 UTC
thanks everyone for the pointed questions. I had tried to keep my question reasonably general, because I thought it would be better for others who have similar questions. the specific application is a data base of published articles. think of `unique-key\|Time Magazine\|Why the monks are great\|Sep 13, 2006\|p245-133\|volume 8\|number 10` the data base, in plain text and this form, is about 5GB now (but could grow to 20GB in the future), and so the ASCII version fits into RAM. usually the DB changes, say, once per month. I could rebuild it every time anew from the store. there is no guarantee on length or uniqueness of anything, except the unique key. I do need quick access into individual words. so, if I want to find all articles that contain the word 'Time' and the work 'monks' and the number 245, my search should be blindingly fast to find all unique-keys that contain the three words, and then display these records. assume access is very frequent, too---say, I wanted to do research that does 'permutation of words' research, so each article launches a search over the data base. the lazy implementation would be to take every word, and put each word as key into a hash with the value being the arrays of unique keys where the word occurs; and a second hash which gives me the record given a unique key. of course, with perl hashes, this would take too much space. from my limited experience with SQL, after I rearrange the data, it would also blow up a lot. on the plus side, this is all "read-only". sql would be ok, but it just feels like it is not the right tool for the job. sql dbs seem made more for updating than for blindingly fast read access. I was also only guessing that SSD would be a good tool for the job. help?	[reply] [d/l]
Re^3: fast disk db with bulk insert, fast read access, compact storage by BrowserUk (Patriarch) on Sep 18, 2010 at 08:24 UTC
Your original description of your application: simple---key, value.* ... 32GB ... (think as application of a word hash that I am rebuilding every night, and I want to do real-time search as my users are typing words.)* Is almost completely at odds with this description: is about 5GB now (but could grow to 20GB in the future) ... the DB changes, say, once per month ... I do need quick access into individual words. so, if I want to find all articles that contain the word 'Time' and the work 'monks' and the number 245, my search should be blindingly fast to find all unique-keys that contain the three words, and then display these records. The former implies indexing by the characters of the unique key only. The latter requires a fully inverted index of the words in the entire records, which essentially makes the unique key redundant. You need to define the actual use your data will be put to, before looking for the mechanism for doing it.	[reply]
Re^4: fast disk db with bulk insert, fast read access, compact storage by iaw4 (Monk) on Sep 18, 2010 at 22:27 UTC
Re^5: fast disk db with bulk insert, fast read access, compact storage by BrowserUk (Patriarch) on Sep 19, 2010 at 07:37 UTC
Some notes below your chosen depth have not been shown here
Re: fast disk db with bulk insert, fast read access, compact storage by mr_mischief (Monsignor) on Sep 15, 2010 at 06:43 UTC
Normally, when I hear "I need to optimize for fast reads with few inserts and can't use SQL" my first two thoughts are "why can't you use SQL?" and "LDAP or some other directory service is great for lots of reads and only a few inserts". However, I'm having trouble understanding your need to fully rebuild the database every night. Is that really what you meant to say? Will your users and the data you search really change that much every day?	[reply]
Re^2: fast disk db with bulk insert, fast read access, compact storage by nikosv (Deacon) on Sep 15, 2010 at 10:04 UTC
my first two thoughts are "why can't you use SQL?" true,or to extended the thought a bit "why not an RDBMS?" For the bulk inserts just use a HEAP structured table which has no keys and so it is the prefered structure for bulk inserts. are your keys unique? then create primary index on the key in the structure of your choice and requirements. I need very fast read access;for example a HASH index would make exact key lookups very fast which at the same time does not make it suitable for pattern matching	[reply]
Re: fast disk db with bulk insert, fast read access, compact storage by Utilitarian (Vicar) on Sep 15, 2010 at 07:59 UTC
The Berkeley DB seems to meet you specification perfectly it's not a straight forward DB by any means however, check out BerkeleyDB, very fast, key / value pairs, db_dump and db_load for bulk loading, full ACID transactions etc... however you would have to host the DB environment on the same server, it doesn't support network access as it's an embedded database. The hash structure would suit your needs. It does fail on the compact storage aspect as a DB Hash is larger than the original data, but that's the trade off for fast access. As suggested previously an LDAP server (Net::LDAP) might also suit your needs. `print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."`	[reply] [d/l]
Re: fast disk db with bulk insert, fast read access, compact storage by roboticus (Chancellor) on Sep 15, 2010 at 04:56 UTC
iaw4: I don't know why you're ruling out SQL, so perhaps a DBM type of database (DBD::DBM, AnyDBM_File, DBM::Deep, etc.) would do for you? ...roboticus	[reply]
Re: fast disk db with bulk insert, fast read access, compact storage by BrowserUk (Patriarch) on Sep 15, 2010 at 02:06 UTC
Are the keys actual words? Just alpha chracters? Do you have reasonable estimates for the maximum and minimum lengths?	[reply]
Re: fast disk db with bulk insert, fast read access, compact storage by zentara (Cardinal) on Sep 15, 2010 at 09:42 UTC
this cache comparison is interesting. It says plain old Storable has the fastest freeze/thaw, but BerkelyDB should be used for longterm storage. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re: fast disk db with bulk insert, fast read access, compact storage by Marshall (Canon) on Sep 15, 2010 at 14:11 UTC
I would like to better understand this: I can get myself an 80GB SSD You can get a nice 1TB hard drive for ~$80. Why is there this restriction? The indexing that a DB does takes space but it cuts down on the "seeking". Using more storage space to achieve higher overall performance is a common trade-off. Using 10x the storage may be faster even if accessing one particular "hunk" of data may be slower than SSD.	[reply]
Re^2: fast disk db with bulk insert, fast read access, compact storage by BrowserUk (Patriarch) on Sep 15, 2010 at 14:37 UTC
If the problem described was fetching a single record by key, you might just be right. But read between the OPs line a little and you can quite imagine that he is trying to implement something like Google's new auto-complete search thing. That means that each time the user types a keystroke, he has to: Re-query the index for a list of record numbers who keys start with the accumulated keystrokes so far. Then read (say) 10 records matching that prefix and present them to the user. And repeat that for each new keystroke. Under that scenario, the SSD will be invaluable. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^3: fast disk db with bulk insert, fast read access, compact storage by Marshall (Canon) on Sep 15, 2010 at 17:23 UTC
Well actually I don't know exactly what the Op is trying to do and I think that Grandfather's questions are on point. What the Op was describing is reminiscent of a telephony feature in the US called "dial by name". A US phone has letters in addition to the touch pad numbers. Number 2 has ABC, number 3 has DEF, etc. For example BrowserUk is: 276973785 assuming that I did that translation right! in this case 7 can mean R or S. Anyway the way that this works for the user is that he/she starts typing letters and the system figures out according to a set of heuristics what to say to the user, eg are you close enough that I should give you say 3 options or keep my mouth shut and let you keep dialing, or say something to get you to keep dialing, etc... Anyway knowing what all of the entries in the directory are and being able to spend some CPU MIPs organizing that into an efficient data structure that the application can use is very helpful to say the least. More memory helps. This fixed vocabulary part will help a lot - minor additions can be done on the fly but this "rebuild every night" part will help a whole lot. At this point, I don't know enough info to say "hey I recommend to do X". I'm just at the point of asking "why do you think that Z is a "requirement/limitation"".?	[reply]
Re^4: fast disk db with bulk insert, fast read access, compact storage by BrowserUk (Patriarch) on Sep 15, 2010 at 18:08 UTC
Re^5: fast disk db with bulk insert, fast read access, compact storage by Marshall (Canon) on Sep 15, 2010 at 19:10 UTC