in reply to BerkeleyDB and a very large file
Suppose each record takes 100 bytes. Then you have about 10 GB of data. Sorting with a merge-sort should take about 30 passes. Each pass involves reading and writing all of your data. So that's about 600 GB of data. Supposing that your disk has a sustained throughput of 60 MB/s, that should take 10,000 seconds, or about 3 hours. Plus CPU time. Minus your savings if your sort implementation is smart enough to do some passes in RAM to minimize writing to disk. But any way you cut it, far faster.
Therefore my standard suggestion for this kind of data volume is to use the Unix sort utility to sort your dataset, then make a BerkeleyDB btree and load that. (Btrees are stored in sorted order, which gets rid of your sustained throughput problems.) This will make your initial data load take just a few hours rather than a week and a half.
As a bonus, for large datasets if there is any locality of reference in your requests (usually there is), a btree makes better use of cached data in RAM than a hash can.
|
|---|