in reply to Re^2: Optimizing DBM::Deep file parameters
in thread Optimizing DBM::Deep file parameters
What that means is that if you sort your data then put it in a hash, the sort is useless. A btree, by contrast, is essentially a tree with the data kept in sorted order. (There are some details to make sure that the tree continues to always be balanced, but it basically looks like a tree.) Therefore if you start with a sorted data set then try to put it into a btree, you will have great locality of reference, which means that a given section of data will be accessed, written, reaccessed, and rewritten many times in RAM before the operating system has to flush to disk. Which means that building your data structure will be fast.
Unfortunately DBM::Deep is internally a hash, not a btree. And there is no option to make it a btree. So you can't use this trick to speed up your load. Plus when you access your data, you still have the same problem that you're accessing data that is all over on the disk.
So unless you want to rewrite DBM::Deep to be a btree internally, or to find a clever way to store your multi-level hash in a btree like BerkeleyDB, you can't find any such simple speedup. Instead you'd need to rethink your entire multi-level hash data structure and rewrite that code.
Odds are fairly good that the right way to do it is to rewrite your multi-level hash as a series of tables in a relational database. Then let the database figure out the correct series of sorts and merges to implement whatever your logic is. Assuming the database comes up with a good plan (they don't always, unfortunately), it will take just a few hours to load and a few hours to query.
But note that for the exact same seek issues that I have been talking about, you'll want to drop all indexes, load the data, then rebuild the indexes. Attempting to load a lot of data while attempting to keep indexes up to date is going to be slow because that forced the database to do lots of seeks to disk.
|
|---|