MLDBM performance problem

henrywalker23 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: MLDBM performance problem by BrowserUk (Patriarch) on Jun 09, 2007 at 19:14 UTC
This is speculation, but it seems to fit your description. Maybe someone from p5p can confirm or deny it? When a hash fills (to some level), it reallocates memory to increase the number of buckets. When it does it doubles the number. It won't happen for every new addition as a hash consists of buckets and chains, and if a new addition hashes to the same value as an existing word, it will be added to the end of the chain for the corresponding bucket and no new buckets will be needed. If however, an new word hashes to a previously unused hash value, then the percentage filled criteria will be met and the hash will be reallocated. Ie. a new hash allocated with double the number of buckets and the contents of the existing hash will be re-hashed(?) and moved into the new hash. This process will (temporarily) require 3x the number of buckets (though not the chains I think). If you are close to your memory limits already, that reallocation could be the cause of the slow down perhaps. Even the 50% increase may not be enough to allow that reallocation. Note: What I'm describing (badly), is the operations of a normal hash. I'm not sure if MLDBM operates in anything like the same way. A possibility to test this is to try printing `scalar %hash` just before and just after adding one of the problem words. How to fix it is another problem, but if you could confirm that my speculation has some foundation, it might lead to a solution. Like I say, your description made me think about the buckets thing, so I thought I would mention it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: MLDBM performance problem by henrywalker23 (Novice) on Jun 10, 2007 at 13:33 UTC
Thanks, I had some idea of how the bucket and chain system works, but I couldn't imagine the number would be doubled or tripled at any stage. This may explain the problem, at least. I'll try your `scalar hash` test, anyway.	[reply] [d/l]
Re^3: MLDBM performance problem by BrowserUk (Patriarch) on Jun 10, 2007 at 14:15 UTC
... but I couldn't imagine the number would be doubled or tripled at any stage It's fairly easily demonstrated. The string returned from `scalar %hash` is the no-of-buckets-used/no-of-buckets-allocated. $buckets = %h; for ('aaaa'..'zzzz'){ $h{$_}=undef; ## Remove the conditional on the next line to see hat no bucket gr +owth ## occurs when an addition hashes to an existing value. print $buckets = %h if %h ne $buckets; };; 1/8 2/8 3/8 4/8 5/8 ##62% 6/16 7/16 8/16 9/16 10/16 ## 62% 15/32 16/32 17/32 18/32 19/32 20/32 21/32 ## 65% 25/64 26/64 27/64 28/64 29/64 30/64 31/64 32/64 33/64 34/64 35/64 36/64 37/64 38/64 39/64 40/64 41/64 ## 64% 51/128 52/128 53/128 54/128 55/128 56/128 57/128 58/128 59/128 60/128 61/128 62/128 63/128 64/128 65/128 66/128 67/128 68/128 69/128 70/128 71/128 72/128 73/128 74/128 75/128 76/128 77/128 78/128 79/128 80/128 81/128 82/128 83/128 84/128 85/128 ## 66% 106/256 [download] Note that `scalar %hash` on a tied hash may do nothing or something completely different. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: MLDBM performance problem by henrywalker23 (Novice) on Jun 11, 2007 at 05:58 UTC
Re: MLDBM performance problem by perrin (Chancellor) on Jun 10, 2007 at 00:05 UTC
Strange. This may have something to do with the BTree implementation in the version of DB_File you have. You might try upgrading your libdb to a later version (this is something you'd do through your OS tools) and upgrading to the latest BerkeleyDB release from CPAN (which includes the latest DB_File). There is also a way to change the cache size that it uses, although you may need to use the API from the the BerkeleyDB module rather than the DB_File one to do it. If your keys are small enough, SDBM_File can be faster than DB_File. It has size limitations though.	[reply]
Re^2: MLDBM performance problem by henrywalker23 (Novice) on Jun 10, 2007 at 13:36 UTC
Thanks, I'll try some of your suggestions.	[reply]
Re^2: MLDBM performance problem by henrywalker23 (Novice) on Jun 11, 2007 at 06:01 UTC
In the meantime I have upgraded DB_File to the latest version - and this solved my problem - at least for the present. I will bear in mind your other suggestions in case it reappears. Thanks again.	[reply]
Re: MLDBM performance problem by ldln (Pilgrim) on Jun 09, 2007 at 18:28 UTC
Have you tried rebuilding the MLDMB hash, that is copying each entry to a new MLDBM hash - it may be corrupted? I've had good success with this approach before.	[reply]
Re^2: MLDBM performance problem by henrywalker23 (Novice) on Jun 10, 2007 at 13:21 UTC
I have tried this, with no better success. In any case, if the hash was corrupted, I would expect problems with the output. In fact, the system works just as well as before - just much slower. Thanks for the suggestion, anyhow.	[reply]
Re: MLDBM performance problem by andreas1234567 (Vicar) on Jun 09, 2007 at 18:56 UTC
Maybe you should consider switching to a relational database backend? MySQL, PostgreSQL, Oracle, IBM DB2, .. you have many to choose from. -- print map{chr}unpack(q{A3}x24,q{074117115116032097110111116104101114032080101114108032104097099107101114})	[reply]
Re^2: MLDBM performance problem by henrywalker23 (Novice) on Jun 10, 2007 at 13:27 UTC
Thanks for the suggestion, but all I need is a persistent hash where the values are array references. I'm worried that a database such as you suggest would involve a lot of overhead and be counter-productive as far as performance is concerned.	[reply]
Re^3: MLDBM performance problem by andreas1234567 (Vicar) on Jun 10, 2007 at 15:52 UTC
I'm worried that a database such as you suggest would involve a lot of overhead and be counter-productive as far as performance is concerned. Perhaps you're right, but unless you Benchmark the two approaches, that's pure speculation. And, if you have issues with paging due to data not fitting into memory, using a proper DBMS is probably a good way out. -- print map{chr}unpack(q{A3}x24,q{074117115116032097110111116104101114032080101114108032104097099107101114})	[reply]
Re^4: MLDBM performance problem by henrywalker23 (Novice) on Jun 11, 2007 at 05:54 UTC
Re: MLDBM performance problem by eserte (Deacon) on Jun 11, 2007 at 10:22 UTC
Note that with DB_File you can choose between two types: DB_HASH and DB_BTREE. Maybe one of both behaves better/worse depending on the keys in use.	[reply]
Re: MLDBM performance problem by DrHyde (Prior) on Jun 12, 2007 at 09:07 UTC
I'm not going to address your performance problem, but it's worth noting that DBM::Deep does a much better job of storing perl data structures than MLDBM does.	[reply]

print map{chr}unpack(q{A3}x24,q{074117115116032097110111116104101114032080101114108032104097099107101114})

print map{chr}unpack(q{A3}x24,q{074117115116032097110111116104101114032080101114108032104097099107101114})