Sorting the data before writing to the hash file will not help unless you can make the file internally be a btree rather than a hash. The reason for that is that hashes work by mapping pieces of data to apparently random buckets. When you go to fetch/insert data, that bucket should have very few entries. Which is what makes the average access time of a hash O(1) in the size of the hash. Unfortunately when a hash is laid out on disk, that means that each value goes to a random location on disk.

What that means is that if you sort your data then put it in a hash, the sort is useless. A btree, by contrast, is essentially a tree with the data kept in sorted order. (There are some details to make sure that the tree continues to always be balanced, but it basically looks like a tree.) Therefore if you start with a sorted data set then try to put it into a btree, you will have great locality of reference, which means that a given section of data will be accessed, written, reaccessed, and rewritten many times in RAM before the operating system has to flush to disk. Which means that building your data structure will be fast.

Unfortunately DBM::Deep is internally a hash, not a btree. And there is no option to make it a btree. So you can't use this trick to speed up your load. Plus when you access your data, you still have the same problem that you're accessing data that is all over on the disk.

So unless you want to rewrite DBM::Deep to be a btree internally, or to find a clever way to store your multi-level hash in a btree like BerkeleyDB, you can't find any such simple speedup. Instead you'd need to rethink your entire multi-level hash data structure and rewrite that code.

Odds are fairly good that the right way to do it is to rewrite your multi-level hash as a series of tables in a relational database. Then let the database figure out the correct series of sorts and merges to implement whatever your logic is. Assuming the database comes up with a good plan (they don't always, unfortunately), it will take just a few hours to load and a few hours to query.

But note that for the exact same seek issues that I have been talking about, you'll want to drop all indexes, load the data, then rebuild the indexes. Attempting to load a lot of data while attempting to keep indexes up to date is going to be slow because that forced the database to do lots of seeks to disk.


In reply to Re^3: Optimizing DBM::Deep file parameters by tilly
in thread Optimizing DBM::Deep file parameters by QM

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.