moggs has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks

I've got a quick query, which I'm sure someone can help me with.

I'm using MLDBM to store fairly complex data in a disk hash, up to 2,000 elements of stock data which comprise of about 20 peices of data per line. At the moment it nice and speedy on a tidy linux box.

My question is will the speed diminish as the volume of data increases to say 10,000 or possibly 50,000? Will it get to a point where reading the data is unbearable slow?

Thanks for this

Moggs

Replies are listed 'Best First'.
Re: MLDBM Efficiency
by davido (Cardinal) on Sep 15, 2004 at 23:17 UTC

    That's an interesting question (really). Before reading Mastering Algorithms with Perl (O'Reilly & Associates), I would have assumed that as you add elements to a conventional Perl hash there must be some increase in the amount of time it takes to look up one of the elements. Even after reading MAwP, which gives an introduction to how hashes are implemented, I might have thought that the order of growth as you add elements would be O(log N), which is a fancy way of saying that as you add elements there is a relatively small growth in the length of time it takes to perform lookups.

    "...might have thought...", I say, because MAwP puts that theory to rest and sets the record straight when it specifically states that a well written hashing algorithm will yield O(1) order of growth, which is a fancy way of saying that there is no growth in the computational work of looking up hash elements as the hash grows large.

    Here is a test verifying that theory with respect to Perl's internal hashes.

    use Benchmark qw/:all/; @lrg_hash{ 0 .. 20000 } = (); @sml_hash{ 0 .. 10 } = (); $sml_hash{ 5000 } = undef; # Create a hash key of the same name in # %sml_hash as in %lrg_hash. cmpthese ( 500000, { Big => '$lrg_hash{5000}=1;', Little => '$sml_hash{5000}=1;' } ); __OUTPUT__ Rate Big Little Big 378788/s -- -1% Little 381679/s 1% --

    Clearly there is an insignificant time difference.

    I'm not familiar with MLDBM's internal algorithm, but if it is a well written hash, as are Perl's hashes, you can expect O(1) order of growth. Go ahead and let the DB grow to 50,000 elements, or 500,000 elements, it won't matter. To confirm this, I ran the following benchmark:

    use MLDBM; use Benchmark qw/:all/; use Fcntl; $dbL = tie %lrg_hash, 'MLDBM', 'lrgtest.dbm', O_CREAT|O_RDWR, 0640 or die $!; $dbS = tie %sml_hash, 'MLDBM', 'smltest.dbm', O_CREAT|O_RDWR, 0640 or die $!; @lrg_hash{ 0 .. 20000 } = (); @sml_hash{ 0 .. 10 } = (); $sml_hash{ 5000 } = undef; # Create a hash key of the same name in # %sml_hash as in %lrg_hash. cmpthese ( 5000, { Big => '$lrg_hash{5000}=1;', Little => '$sml_hash{5000}=1;' } ); __OUTPUT__ Rate Big Little Big 3247/s -- -4% Little 3378/s 4% --

    Although the DB access slows things down considerably, there is no appreciable difference between the speed of looking up a hash element in the 20,000 element hash, and the 11 element hash.


    Dave

Re: MLDBM Efficiency
by graff (Chancellor) on Sep 16, 2004 at 03:00 UTC
    I can't speak from personal experience on MLDBM, but based on related experience I've had with DBM-based stuff, I would say "it depends". It's been a couple years since I've done anything myself with DBM files, so things may have changed in the course of various upgrades, but I do recall that one or another DBM package (probably the Gnu one) had a problem when it came to building really large DBM files.

    But once the files were built, retrieval was never a problem -- retrieval time does not increase noticeably with the amount of data being stored (in terms of number of keys or overall content). That's the whole point of DBM files.

    (I think the problem with building the DBM file has to do with how often, and in what manner, the tool has to rebuild its hashing table -- the indexes used for seeking to the data for a given key -- as more keys are added. The DBM flavors that keep the hash index and data in separate files are likely to be faster than those that create only a single file with indexes and data combined.)

    So if growing the database will be an ongoing activity in your app (such that users would be bothered if there were a 2-minute lag when they hit a button to add data), you'll want to test MLDBM in terms of how long it takes to add increments of, say, 5K records at a time, and look out for either occasional or consistent delays in each increment.

    An adequate test would be to generate an appropriate number of "random" strings with suitable length, maybe using MD5 checksums as the keys or something like that.

Re: MLDBM Efficiency
by sintadil (Pilgrim) on Sep 15, 2004 at 22:24 UTC

    My question is will the speed diminish as the volume of data increases to say 10,000 or possibly 50,000? Will it get to a point where reading the data is unbearable slow?

    There's a saying among programmers, especially in the Perl community:

    When in doubt, benchmark!

    I suggest that you read the docs for Benchmark.

Re: MLDBM Efficiency
by moggs (Sexton) on Sep 29, 2004 at 10:39 UTC
    Many, many thanks for these comments.

    I have been benchmarking but couldn't draw a definite conclusion from it.

    The code and ideas for testing were great and I've numerour hours of fun running tests.

    Cheers

    Moggs