QM has asked for the wisdom of the Perl Monks concerning the following question:

DBM::Deep has the following file parameters:
# max_buckets: the number of entries that can be added before a reindexing, [16..256], default 16.

# data_sector_size: the size in bytes of a given data sector. [32..256], default 64.

Assume I have a representative sample of the data that would be stored. I want to choose parameter values to optimize the speed and diskspace performance. I would prioritize speed, but I would give up a few percentage points for a big improvement in disk space.

I have a script that will read several GB of input data and create/update the hash/db file, then output some of the hash data in table format. Processing time can take several days, and the db file size can be several GB also. Space isn't too restrictive, but obviously I want to take up less rather than more.

Is there some intelligent approach to optimizing these values, such as by inspection of my sample data? For example, what if I knew what the mean, median, or mode strings for keys and values were? Or the mean/median/mode hash depth? Note that the hash structure and contents depend on command line options (given the same input), so in some cases I would want the D::D parameters to change also. [I know that once the D::D file is written, these parameters can't be changed.]

Or should I just run benchmarks on all (power of two?) combinations of these parameters?

-QM
--
Quantum Mechanics: The dreams stuff is made of

Replies are listed 'Best First'.
Re: Optimizing DBM::Deep file parameters
by tilly (Archbishop) on Aug 29, 2008 at 16:33 UTC
    I'm sorry, but you aren't going to make a big performance difference without rewriting your code. And that isn't the fault of DBM::Deep. If you want to make things fast, you should rewrite your code so that you sort your data, and then process your data in sorted order. That should be at least an order of magnitude faster than what you're doing now.

    The physical reason is the same as the one that I described in Re: BerkeleyDB and a very large file. Seeks to disk are slow. When you're writing random records in DBM::Deep, you're forced to do a lot of seeks. But sorting can be done streaming data to and from disk. And disks are much better at that than at random access.

      Good points.

      To some degree, I think I can sort the data before writing to the hash/file. However, there are several points that will hinder that:

      1) The hash is multilevel. 99% of the data inserts would use the first example:

      $hash{DATA}{k1}{k2}{k3}{k4}{k5} = value1; $hash{DATA}{k1}{k2}{k3}{k4}{h1}{h2} = value2; $hash{HEADER1}{h1}{h2} = value3; $hash{HEADER2}{j1} = value4;
      Would this be a hurdle?

      2) In some cases it's possible to know a lot about the contents of an input file based on it's name and path. But not all. Input data files could be sorted according to the apparent contents.

      Would it then make sense to use a %temp_hash for each input file, and then import to the %tied_hash at the end of each file?

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        Sorting the data before writing to the hash file will not help unless you can make the file internally be a btree rather than a hash. The reason for that is that hashes work by mapping pieces of data to apparently random buckets. When you go to fetch/insert data, that bucket should have very few entries. Which is what makes the average access time of a hash O(1) in the size of the hash. Unfortunately when a hash is laid out on disk, that means that each value goes to a random location on disk.

        What that means is that if you sort your data then put it in a hash, the sort is useless. A btree, by contrast, is essentially a tree with the data kept in sorted order. (There are some details to make sure that the tree continues to always be balanced, but it basically looks like a tree.) Therefore if you start with a sorted data set then try to put it into a btree, you will have great locality of reference, which means that a given section of data will be accessed, written, reaccessed, and rewritten many times in RAM before the operating system has to flush to disk. Which means that building your data structure will be fast.

        Unfortunately DBM::Deep is internally a hash, not a btree. And there is no option to make it a btree. So you can't use this trick to speed up your load. Plus when you access your data, you still have the same problem that you're accessing data that is all over on the disk.

        So unless you want to rewrite DBM::Deep to be a btree internally, or to find a clever way to store your multi-level hash in a btree like BerkeleyDB, you can't find any such simple speedup. Instead you'd need to rethink your entire multi-level hash data structure and rewrite that code.

        Odds are fairly good that the right way to do it is to rewrite your multi-level hash as a series of tables in a relational database. Then let the database figure out the correct series of sorts and merges to implement whatever your logic is. Assuming the database comes up with a good plan (they don't always, unfortunately), it will take just a few hours to load and a few hours to query.

        But note that for the exact same seek issues that I have been talking about, you'll want to drop all indexes, load the data, then rebuild the indexes. Attempting to load a lot of data while attempting to keep indexes up to date is going to be slow because that forced the database to do lots of seeks to disk.

Re: Optimizing DBM::Deep file parameters
by jettero (Monsignor) on Aug 29, 2008 at 13:49 UTC

    You might also try on the google groups area specific to this module. That's a pretty responsive group. (Several of them are here also though.)

    -Paul

Re: Optimizing DBM::Deep file parameters
by dragonchild (Archbishop) on Aug 31, 2008 at 01:32 UTC
    max_buckets is useful to increase if you know that you will have a large number of elements in a given hash. data_sector_size is useful to increase if you know your data averages more than 50-60 bytes. This is how much dbm-deep allocates for every key, data value, and classname. If the value is more than this amount (minus a constant dependent on other parameters), a second data sector is created and they are chained.

    To answer the next question, I have no numbers on how these values will affect performance. Frankly, I have done extremely little performance testing on dbm-deep and know of very little that's been done. I would be delighted if someone were to put together a few tests or scripts that I can have in the repository to test the effects of various performance improvements I'm working on.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?