mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
I have been working on a web based bioinformatics program written in Perl by a non programmer.
My goal is to try and make it work faster. One thing I noticed is that the program creates huge hashes (millions of keys) that are the same nearly every time (according to the input).
What I did was create these hashes before hand, and store them using Storable.
Then, each time the program runs, I retrieve them, thus saving a lot of time (it works in about half the time now)
Problem is, retrieving the hashes still takes a very expensive 12 seconds (out of 18 seconds it takes the program to run).

My questions are:

1)Whether there are other serialization modules that are faster than Storable?

2)Any other word of advice how to make very large hashes work faster?

I know the best solution is to try and rewrite the program, but I'm not sure that's an option now. I also tried to store the data in a DB, but the program then requires many DB calls which takes too much time.
Thanks in advance,
mrguy123

I have yet to see any problem, however complicated, which, when you looked at it in the right way, did not become still more complicated. - Poul Anderson

Replies are listed 'Best First'.
Re: Fast(er) serialization in Perl
by The Perlman (Scribe) on Apr 11, 2010 at 11:47 UTC
    How do the keys and values look like?

    With "millions of keys" chances are high that they are highly uniform.

    That means you might be able to transform it to an array or array like structure.

    You might even be able to store/load this array with the help of pack/unpack which would speed-up things significantly.

    If it's "web based" does it mean it's a CGI? You might want to consider fastcgi or mode-perl to keep the data in memory between different queries.

      Thanks
      Because most of the logic of the program is based on the hashes (and I'm not sure I want to change it just yet) I want to keep the hashes functional, so I'm not sure if if I can use the array option
      Regarding the web time, because of University guidelines the web based part is actually in PHP :(.
      However, it is not the bottleneck that needs fixing most urgently
      The question is, can I keep my giant hashes, and still save time (mostly on loading them)?
        >I want to keep the hashes functional

        as I told you further down you can use Tie::Hash to keep the interface functional.

        IMHO there are no general solutions faster than storable ! *

        You need to provide more infos.

        If your php is calling your perlscript you should check if you can hold the data structure in memory.

        You may also check if you're not running into RAM problems causing massive swappings.

        I once speeded up a program just by trasforming a huge hash into a hash of hashes (by halving the keys). Since the system only retrieved the sub-hashes actually needed from disk swap, I had a fantastic speed gain.

        If this is transferable to your case is unknown since you do not provide enough infos...

        Footnote: (*)

        from Storable

        "SPEED
        The heart of Storable is written in C for decent speed. Extra low-level optimizations have been made when manipulating perl internals, to sacrifice encapsulation for the benefit of greater speed."

        IMHO it's evident that you need to invest brain to achieve further speed gains!
        the web based part is actually in PHP

        you should benchmark if storable is really your bottleneck, starting a non persistent perl-process takes some time ...

Re: Fast(er) serialization in Perl
by BrowserUk (Patriarch) on Apr 11, 2010 at 16:00 UTC

    Like others, I think there is almost certainly a better, (more space efficient and faster to load), method of storing your data, that would be equally if not more efficient for performing your lookups and require minimal changes to your script.

    The key to transforming your hashes to that form, is more details about the nature of the data. If you were to run this script against your hash structure--fill in the name of your Storable file and redirect the output to another file--and post the output, you might get suggestions for how to perform that transformation:

    #! perl -slw use strict; use List::Util qw[ max minstr maxstr ]; use Storable qw[ retrieve ]; my $h = retrieve '/path/to/yourfile'; for my $l1 ( keys %{ $h } ) { for my $l2 ( keys %{ $h->{ $l1 } } ) { printf "$l1->$l2: N: %d minL3: %s maxL3: %s minVal: %d\n", scalar( keys %{ $h->{ $l1 }{ $l2 } } ), minstr( keys %{ $h->{ $l1 }{ $l2 } } ), maxstr( keys %{ $h->{ $l1 }{ $l2 } } ), max( values %{ $h->{ $l1 }{ $l2 } } ); } }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast(er) serialization in Perl
by wfsp (Abbot) on Apr 11, 2010 at 11:40 UTC
    Perhaps consider a disk based hash like DBM::Deep.
Re: Fast(er) serialization in Perl
by The Perlman (Scribe) on Apr 11, 2010 at 12:10 UTC
    do all keys have the same importance? You may want to use a cache solution holding only the most frequently used keys. If another key is requested retrieve it from disk and add it to the cache.

    (Again, if you can linearize your data in an array disk look up can be very fast because you can calculate the offset and set it with the help of seek, )

    Realizing this with Tie::Hash would keep the interface to your data structure stable and spare you from any refactorings.

Re: Fast(er) serialization in Perl
by thezip (Vicar) on Apr 11, 2010 at 14:48 UTC

    You might consider using the profiler Devel::NYTProf to assess where the bottlenecks are in your code.

    It provides verbose indication of how much time has been spent executing code at the subroutine and statement level.


    What can be asserted without proof can be dismissed without proof. - Christopher Hitchens
Re: Fast(er) serialization in Perl
by Anonymous Monk on Apr 11, 2010 at 11:07 UTC
    BerkeleyDB or DB_File might be faster, test it out