Re^3: Optimizing DBM::Deep file parameters

Sorting the data before writing to the hash file will not help unless you can make the file internally be a btree rather than a hash. The reason for that is that hashes work by mapping pieces of data to apparently random buckets. When you go to fetch/insert data, that bucket should have very few entries. Which is what makes the average access time of a hash O(1) in the size of the hash. Unfortunately when a hash is laid out on disk, that means that each value goes to a random location on disk.

What that means is that if you sort your data then put it in a hash, the sort is useless. A btree, by contrast, is essentially a tree with the data kept in sorted order. (There are some details to make sure that the tree continues to always be balanced, but it basically looks like a tree.) Therefore if you start with a sorted data set then try to put it into a btree, you will have great locality of reference, which means that a given section of data will be accessed, written, reaccessed, and rewritten many times in RAM before the operating system has to flush to disk. Which means that building your data structure will be fast.

Unfortunately DBM::Deep is internally a hash, not a btree. And there is no option to make it a btree. So you can't use this trick to speed up your load. Plus when you access your data, you still have the same problem that you're accessing data that is all over on the disk.

So unless you want to rewrite DBM::Deep to be a btree internally, or to find a clever way to store your multi-level hash in a btree like BerkeleyDB, you can't find any such simple speedup. Instead you'd need to rethink your entire multi-level hash data structure and rewrite that code.

Odds are fairly good that the right way to do it is to rewrite your multi-level hash as a series of tables in a relational database. Then let the database figure out the correct series of sorts and merges to implement whatever your logic is. Assuming the database comes up with a good plan (they don't always, unfortunately), it will take just a few hours to load and a few hours to query.

But note that for the exact same seek issues that I have been talking about, you'll want to drop all indexes, load the data, then rebuild the indexes. Attempting to load a lot of data while attempting to keep indexes up to date is going to be slow because that forced the database to do lots of seeks to disk.

Comment on Re^3: Optimizing DBM::Deep file parameters