in reply to Re: Fast(er) serialization in Perl
in thread Fast(er) serialization in Perl

Thanks
Because most of the logic of the program is based on the hashes (and I'm not sure I want to change it just yet) I want to keep the hashes functional, so I'm not sure if if I can use the array option
Regarding the web time, because of University guidelines the web based part is actually in PHP :(.
However, it is not the bottleneck that needs fixing most urgently
The question is, can I keep my giant hashes, and still save time (mostly on loading them)?

Replies are listed 'Best First'.
Re^3: Fast(er) serialization in Perl
by The Perlman (Scribe) on Apr 11, 2010 at 12:41 UTC
    >I want to keep the hashes functional

    as I told you further down you can use Tie::Hash to keep the interface functional.

    IMHO there are no general solutions faster than storable ! *

    You need to provide more infos.

    If your php is calling your perlscript you should check if you can hold the data structure in memory.

    You may also check if you're not running into RAM problems causing massive swappings.

    I once speeded up a program just by trasforming a huge hash into a hash of hashes (by halving the keys). Since the system only retrieved the sub-hashes actually needed from disk swap, I had a fantastic speed gain.

    If this is transferable to your case is unknown since you do not provide enough infos...

    Footnote: (*)

    from Storable

    "SPEED
    The heart of Storable is written in C for decent speed. Extra low-level optimizations have been made when manipulating perl internals, to sacrifice encapsulation for the benefit of greater speed."

    IMHO it's evident that you need to invest brain to achieve further speed gains!
      This is how the hash basically looks (it goes on for about 6 million lines (genes)):
      $VAR1 = { 'microT' => { 'mmu-miR-704' => { 'NM_009309' => '1', 'NM_133983' => '1', 'NM_175563' => '1', 'NM_010889' => '1', 'NM_008302' => '1', 'NM_022023' => '1', 'NM_009567' => '1', 'NM_172938' => '1', 'NM_029777' => '3', 'NM_134189' => '1', 'NM_175025' => '1', 'NM_177327' => '1', 'NM_026807' => '1', 'NM_178779' => '3', 'NM_010770' => '1', 'NM_031998' => '1', 'NM_145584' => '2', 'NM_207682' => '1', 'NM_001005525' => '1', 'NM_080853' => '1', 'NM_145519' => '1', 'NM_031249' => '1', 'NM_172923' => '1', 'NM_001008700' => '1', 'NM_198617' => '1', 'NM_027400' => '1', 'NM_026406' => '2', 'NM_021296' => '2', 'NM_027652' => '1', 'NM_001045530' => '1', 'NM_018830' => '1', 'NM_025314' => '1', 'NM_009041' => '1', 'NM_026829' => '3', 'NM_026618' => '1', 'NM_027472' => '1', 'NM_027870' => '1', 'NM_001033239' => '1', 'NM_026348' => '1', 'NM_008223' => '1', 'NM_009595' => '2', 'NM_146094' => '1', 'NM_144945' => '1', 'NM_019510' => '1', 'NM_001033251' => '1', 'NM_001081213' => '3', 'NM_008031' => '1', 'NM_028719' => '1', 'NM_133352' => '1', 'NM_008133' => '1', 'NM_008317' => '1', 'NM_021327' => '1', 'NM_178751' => '1', 'NM_010260' => '1', 'NM_025683' => '1', 'NM_026383' => '1', 'NM_001081367' => '1', 'NM_001033354' => '2', 'NM_026034' => '1', 'NM_173395' => '1', 'NM_010762' => '1', 'NM_024432' => '1', 'NM_175113' => '1', 'NM_001077425' => '1', 'NM_026374' => '1', 'NM_026655' => '1', 'NM_177345' => '1', 'NM_027412' => '1', 'NM_183187' => '1', 'NM_016687' => '1', 'NM_175640' => '1', 'NM_007559' => '1', 'NM_011269' => '1', 'NM_010252' => '1', 'NM_019657' => '1',
      I'm not very familiar with Tie::Hash, so if you can give me a quick heads up on how I can use it to save time it would be great
        > I'm not very familiar with Tie::Hash, so if you can give me a quick heads up on how I can use it to save time it would be great

        Wouldn't it be even greater if you try to read the detailled docs and tell us what you don't understand? ;-)

        Your hash really looks like a perverted array ...

        try to figure out how many lookups are performed and if they can be grouped in smaller data structures.

        BTW: If your university prefers PHP but accepts blocking large parts of the RAM (6 million hash entries can easily result in 1GB or more memory consumption) something seems terribly wrong...

        is 'NM_001005525' different from 'NM_1005525' ?

        if not you have a serious bug...

        if yes

        90% of your keys have 6 digits putting this data into an array with 1 million entries seems reasonable... resulting in 2 MB of memory consumption if you can limit your values to 64536 numbers (it's a counter isn't it?)

        the other keys have 9 digits, so it seems you are coding your genes in groups of 3 digits. All of them start with "001"

        So generally - from what you show - a hash of arrays seems reasonable where the hash key represents the first 3 digits and the array the rest.

Re^3: Fast(er) serialization in Perl
by The Perlman (Scribe) on Apr 11, 2010 at 13:17 UTC
    the web based part is actually in PHP

    you should benchmark if storable is really your bottleneck, starting a non persistent perl-process takes some time ...

      The retrieval of the hash takes 12 seconds (out of less than 20 secs overall) so if I can take that number down a bit, I'm happy
        try measuring the memory consumption ...
        I also tried to store the data in a DB, but the program then requires many DB calls which takes too much time.

        Long term it sounds like a "real" DB is the way to go. I've been experimenting with MySQL and the performance (when you are careful and get the tables optimized for your app) is fantastic. I don't understand the type of queries that you are making to this huge hash structure - there must be a lot of queries for the app to take 8 seconds past the 12 seconds to load the hash. A "flat" and appropriately indexed SQL DB can be rocket fast - the idea is to push the logic to collect the data for query X into the DB (i.e. get a result set, not data that you collect into a result from multiple queries.)

        Books that I would recommend are:
        Learning SQL by Alan Beaulieu
        MySQL in a Nutshell by Russell Dyer (also has description of Perl DBI and PHP I/F)

        Your hash tables are huge. As a possible intermediate step, you could make a Perl server that is initialized with these huge hash tables. Have clients connect to it and ask questions that translate very directly into hash table queries. That would save the 12 seconds of loading the hash tables. You don't mention how many clients could be connected to such an app, but it could be that a single process, and processing a queue one request at a time would be just fine - doing better than 12 seconds ought to be easy. Other solutions fastCGI or modPerl are good, but I worry about running your machine out of memory and disk thrashing.

        Could you give an example of the type of query that you are running against this hash table structure?