Re: BerkeleyDB and a very large file

Advice on your initial load. Loading that file naively means doing 100 million writes. Each write means seeking to read, reading, then seeking to write and writing So 200 million seeks on disk. If your disk spins at 6000 RPM it spins 100 times per second, and seek time averages out to the time for it to spin halfway. So you do 200 seeks per second. That takes a million seconds, which is about a week and a half.

Suppose each record takes 100 bytes. Then you have about 10 GB of data. Sorting with a merge-sort should take about 30 passes. Each pass involves reading and writing all of your data. So that's about 600 GB of data. Supposing that your disk has a sustained throughput of 60 MB/s, that should take 10,000 seconds, or about 3 hours. Plus CPU time. Minus your savings if your sort implementation is smart enough to do some passes in RAM to minimize writing to disk. But any way you cut it, far faster.

Therefore my standard suggestion for this kind of data volume is to use the Unix sort utility to sort your dataset, then make a BerkeleyDB btree and load that. (Btrees are stored in sorted order, which gets rid of your sustained throughput problems.) This will make your initial data load take just a few hours rather than a week and a half.

As a bonus, for large datasets if there is any locality of reference in your requests (usually there is), a btree makes better use of cached data in RAM than a hash can.

Comment on Re: BerkeleyDB and a very large file