isync has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

I am using a large sdbm database to store md5 hashes (as key) and about 32bytes of data per entry (as value).
$ref = tie(%hashs,'SDBM_File', "$path/hashs", O_RDWR|O_CREAT, 0644) or + die "Couldn't tie SDBM file: $!; aborting";

I know about sdbm's 1024/1008 bytes limit (which I am not exceeding) and sdbm is doing good work for me. As far as I know it does not drop data like mentioned in here on my 3.5GB db with millions of records... And its fast! (beating db_file and everything else..)

But how can I make it even faster? Is there a way to tell sdbm to reduce the page size to, in my case, 160bytes? Would this give a speed gain? Or should I just reduce my amount of data to, let's say 92 bytes using a different hash and less payload. Would sdbm adapt? (how do I tell sdbm to use a different PAIRMAX like the readme sais)

I am adding 20-60 new hashes per write-cycle. Does it help sdbm's internal seeks to feed it the hashes in sorted order? (and if so, what exactly is sdbm's internal sort? any examples?)

So many questions... Any answers?

Replies are listed 'Best First'.
Re: Further optimize usage of SDBM_File
by perrin (Chancellor) on Jun 14, 2007 at 17:34 UTC
    Try calling STORE/FETCH directly instead of using the tied hash. That should be faster. Tied hashes are for suckers.
      Really?? (never liked them too)
      I will try it.
Re: Further optimize usage of SDBM_File
by samtregar (Abbot) on Jun 14, 2007 at 18:12 UTC
    I think you've got plenty of good ideas for possible optimizations. But first, have you profiled your app? If not, break out something like Devel::DProf or Devel::Profile. These can tell you if your DB access is really your bottleneck. If it is, try making these changes and validate them by re-running the profiler, or by using Benchmark (but be sure to re-profile before declaring success!).

    And just because I can't help myself, you might consider the fact that you're probably double-hashing your data. You hash something to produce an MD5 key and then SDBM re-hashes that key into an internal key. That's probably a waste of time - perhaps you could feed SDBM a natural key instead, if you have one.

    -sam

      Yes, I did profile it with Devel::DProf. Sdbm is not the biggest concern, but maybe the only one left optimizable as it is the last really accessing the disk...

      Does sdbm really "re-hash (my) key into an internal key."?? The data I am hashing is about 300bytes and I did the hashing to reduce data while getting a (quite) unique key... My understanding was that sdbm would 1:1 use the supplied data as key, but if it really hashes it - I would revert to feeding it the original.. Are you sure?

      What about the "feed the data sorted"? Is there any advantage in doing so?

      And what about reducing pagesize? Ever tried?? (and can I anticipate what sdbm hashes it to? Sorting senseless??)
        Does sdbm really "re-hash (my) key into an internal key."?? The data I am hashing is about 300bytes and I did the hashing to reduce data while getting a (quite) unique key... My understanding was that sdbm would 1:1 use the supplied data as key, but if it really hashes it - I would revert to feeding it the original.. Are you sure?

        How could it implement a hash table without hashing the keys? I'm no SDBM expert, but this leads me to believe my guess is correct:

        http://www.partow.net/programming/hashfunctions/#SDBMHashFunction

        I can't answer your more specific performance questions. I doubt anyone can, with the possible exception of the people who wrote SDBM. I suggest you setup some benchmarks and try it!

        -sam