in reply to Optimizing DBM::Deep file parameters

I'm sorry, but you aren't going to make a big performance difference without rewriting your code. And that isn't the fault of DBM::Deep. If you want to make things fast, you should rewrite your code so that you sort your data, and then process your data in sorted order. That should be at least an order of magnitude faster than what you're doing now.

The physical reason is the same as the one that I described in Re: BerkeleyDB and a very large file. Seeks to disk are slow. When you're writing random records in DBM::Deep, you're forced to do a lot of seeks. But sorting can be done streaming data to and from disk. And disks are much better at that than at random access.

  • Comment on Re: Optimizing DBM::Deep file parameters

Replies are listed 'Best First'.
Re^2: Optimizing DBM::Deep file parameters
by QM (Parson) on Aug 29, 2008 at 20:49 UTC
    Good points.

    To some degree, I think I can sort the data before writing to the hash/file. However, there are several points that will hinder that:

    1) The hash is multilevel. 99% of the data inserts would use the first example:

    $hash{DATA}{k1}{k2}{k3}{k4}{k5} = value1; $hash{DATA}{k1}{k2}{k3}{k4}{h1}{h2} = value2; $hash{HEADER1}{h1}{h2} = value3; $hash{HEADER2}{j1} = value4;
    Would this be a hurdle?

    2) In some cases it's possible to know a lot about the contents of an input file based on it's name and path. But not all. Input data files could be sorted according to the apparent contents.

    Would it then make sense to use a %temp_hash for each input file, and then import to the %tied_hash at the end of each file?

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Sorting the data before writing to the hash file will not help unless you can make the file internally be a btree rather than a hash. The reason for that is that hashes work by mapping pieces of data to apparently random buckets. When you go to fetch/insert data, that bucket should have very few entries. Which is what makes the average access time of a hash O(1) in the size of the hash. Unfortunately when a hash is laid out on disk, that means that each value goes to a random location on disk.

      What that means is that if you sort your data then put it in a hash, the sort is useless. A btree, by contrast, is essentially a tree with the data kept in sorted order. (There are some details to make sure that the tree continues to always be balanced, but it basically looks like a tree.) Therefore if you start with a sorted data set then try to put it into a btree, you will have great locality of reference, which means that a given section of data will be accessed, written, reaccessed, and rewritten many times in RAM before the operating system has to flush to disk. Which means that building your data structure will be fast.

      Unfortunately DBM::Deep is internally a hash, not a btree. And there is no option to make it a btree. So you can't use this trick to speed up your load. Plus when you access your data, you still have the same problem that you're accessing data that is all over on the disk.

      So unless you want to rewrite DBM::Deep to be a btree internally, or to find a clever way to store your multi-level hash in a btree like BerkeleyDB, you can't find any such simple speedup. Instead you'd need to rethink your entire multi-level hash data structure and rewrite that code.

      Odds are fairly good that the right way to do it is to rewrite your multi-level hash as a series of tables in a relational database. Then let the database figure out the correct series of sorts and merges to implement whatever your logic is. Assuming the database comes up with a good plan (they don't always, unfortunately), it will take just a few hours to load and a few hours to query.

      But note that for the exact same seek issues that I have been talking about, you'll want to drop all indexes, load the data, then rebuild the indexes. Attempting to load a lot of data while attempting to keep indexes up to date is going to be slow because that forced the database to do lots of seeks to disk.