BerkeleyDB and a very large file

kahn_ray@yahoo.com has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: BerkeleyDB and a very large file by graff (Chancellor) on Aug 22, 2008 at 01:42 UTC
The retrieval rate for data in Berkeley DB files is very quick, and does not depend in any noticeable way on the amount of data in a given DB file. If the basic hash-lookup strategy is adequate to your needs, it's a good way to go. But doing the initial creation of a really large Berkeley DB file may take a very long time for a data set in your range (100 million records), so have patience (and budget time) for that step.	[reply]
Re: BerkeleyDB and a very large file by Illuminatus (Curate) on Aug 21, 2008 at 22:59 UTC
BerkeleyDB does put a few limitations on you: 1) Not SQL. Should you decide you need to change databases for some reason, it will be more effort. 2) I am not sure if BerkeleyDB supports network access. As previously mentioned, network access is significantly slower, but it may be advantageous should you find a need to access the data from more than one system. 3) If you have lots of memory, and your record sizes are not too large, you can store the entire table in RAM, which makes MySQL much faster than using a disk table. If you need to access the database very heavily, and the above are not important to you, BerkeleyDB is probably the way to go.	[reply]
Re^2: BerkeleyDB and a very large file by alexm (Chaplain) on Aug 22, 2008 at 01:06 UTC
Another one, quite important: BerkeleyDB files are platform dependant, i.e. you cannot use a little-endian file on a big-endian platform and vice versa. On the other hand, BerkeleyDB doesn't provide network support, it's file based only.	[reply]
Re^3: BerkeleyDB and a very large file by Anonymous Monk on Aug 22, 2008 at 07:41 UTC
BerkeleyDB files are platform dependant, i.e. you cannot use a little-endian file on a big-endian platform and vice versa. Yes you can. http://www.oracle.com/technology/documentation/berkeley-db/db/ref/am_misc/faq.html	[reply]
Re^4: BerkeleyDB and a very large file by alexm (Chaplain) on Aug 22, 2008 at 23:05 UTC
Re: BerkeleyDB and a very large file by perrin (Chancellor) on Aug 21, 2008 at 21:20 UTC
Well, there really is nothing faster than BerkeleyDB for doing key lookups on data that's too big to fit in RAM. Depending on your experience, you might be satisfied with just using a local MySQL. If you try MySQL, make sure you connect to it as localhost. It makes a difference in performance.	[reply]
Re^2: BerkeleyDB and a very large file by Your Mother (Archbishop) on Aug 22, 2008 at 00:00 UTC
And use MyISAM instead of InnoDB or another option (I think there are more now?). It doesn't support foreign key and atomicity stuff but unless something has changed in the last couple years it's still the fastest relational DB and it can be tuned to your specific data if you know how.	[reply]
Re^3: BerkeleyDB and a very large file by perrin (Chancellor) on Aug 22, 2008 at 17:54 UTC
InnoDB is usually faster for me, but for single key unique lookups like this I expect MyISAM will be better.	[reply]
Re: BerkeleyDB and a very large file by BrowserUk (Patriarch) on Aug 23, 2008 at 10:28 UTC
Since no one else has asked I will: What does your data look like? That is, how long are the keys? how long are the associated records? A couple of typical examples would be good. Configuring the page size and cache size parameters to suit your data can have a strong influence upon performance. Also, how are you going to access (search) the records? That is, will your accesses be: Totally random; Initially random followed by sequential; Always sequential; Other Selecting the 'right' access method--btree, hash, queue, recno--to achieve best performance depends a lot upon how you access the data. Without knowledge of how your application will operate, advise is at best, generic. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: BerkeleyDB and a very large file by tilly (Archbishop) on Aug 23, 2008 at 00:07 UTC
Advice on your initial load. Loading that file naively means doing 100 million writes. Each write means seeking to read, reading, then seeking to write and writing So 200 million seeks on disk. If your disk spins at 6000 RPM it spins 100 times per second, and seek time averages out to the time for it to spin halfway. So you do 200 seeks per second. That takes a million seconds, which is about a week and a half. Suppose each record takes 100 bytes. Then you have about 10 GB of data. Sorting with a merge-sort should take about 30 passes. Each pass involves reading and writing all of your data. So that's about 600 GB of data. Supposing that your disk has a sustained throughput of 60 MB/s, that should take 10,000 seconds, or about 3 hours. Plus CPU time. Minus your savings if your sort implementation is smart enough to do some passes in RAM to minimize writing to disk. But any way you cut it, far faster. Therefore my standard suggestion for this kind of data volume is to use the Unix sort utility to sort your dataset, then make a BerkeleyDB btree and load that. (Btrees are stored in sorted order, which gets rid of your sustained throughput problems.) This will make your initial data load take just a few hours rather than a week and a half. As a bonus, for large datasets if there is any locality of reference in your requests (usually there is), a btree makes better use of cached data in RAM than a hash can.	[reply]