in reply to Re^3: Fast(er) serialization in Perl
in thread Fast(er) serialization in Perl

This is how the hash basically looks (it goes on for about 6 million lines (genes)):
$VAR1 = { 'microT' => { 'mmu-miR-704' => { 'NM_009309' => '1', 'NM_133983' => '1', 'NM_175563' => '1', 'NM_010889' => '1', 'NM_008302' => '1', 'NM_022023' => '1', 'NM_009567' => '1', 'NM_172938' => '1', 'NM_029777' => '3', 'NM_134189' => '1', 'NM_175025' => '1', 'NM_177327' => '1', 'NM_026807' => '1', 'NM_178779' => '3', 'NM_010770' => '1', 'NM_031998' => '1', 'NM_145584' => '2', 'NM_207682' => '1', 'NM_001005525' => '1', 'NM_080853' => '1', 'NM_145519' => '1', 'NM_031249' => '1', 'NM_172923' => '1', 'NM_001008700' => '1', 'NM_198617' => '1', 'NM_027400' => '1', 'NM_026406' => '2', 'NM_021296' => '2', 'NM_027652' => '1', 'NM_001045530' => '1', 'NM_018830' => '1', 'NM_025314' => '1', 'NM_009041' => '1', 'NM_026829' => '3', 'NM_026618' => '1', 'NM_027472' => '1', 'NM_027870' => '1', 'NM_001033239' => '1', 'NM_026348' => '1', 'NM_008223' => '1', 'NM_009595' => '2', 'NM_146094' => '1', 'NM_144945' => '1', 'NM_019510' => '1', 'NM_001033251' => '1', 'NM_001081213' => '3', 'NM_008031' => '1', 'NM_028719' => '1', 'NM_133352' => '1', 'NM_008133' => '1', 'NM_008317' => '1', 'NM_021327' => '1', 'NM_178751' => '1', 'NM_010260' => '1', 'NM_025683' => '1', 'NM_026383' => '1', 'NM_001081367' => '1', 'NM_001033354' => '2', 'NM_026034' => '1', 'NM_173395' => '1', 'NM_010762' => '1', 'NM_024432' => '1', 'NM_175113' => '1', 'NM_001077425' => '1', 'NM_026374' => '1', 'NM_026655' => '1', 'NM_177345' => '1', 'NM_027412' => '1', 'NM_183187' => '1', 'NM_016687' => '1', 'NM_175640' => '1', 'NM_007559' => '1', 'NM_011269' => '1', 'NM_010252' => '1', 'NM_019657' => '1',
I'm not very familiar with Tie::Hash, so if you can give me a quick heads up on how I can use it to save time it would be great

Replies are listed 'Best First'.
Re^5: Fast(er) serialization in Perl
by The Perlman (Scribe) on Apr 11, 2010 at 13:47 UTC
    > I'm not very familiar with Tie::Hash, so if you can give me a quick heads up on how I can use it to save time it would be great

    Wouldn't it be even greater if you try to read the detailled docs and tell us what you don't understand? ;-)

    Your hash really looks like a perverted array ...

    try to figure out how many lookups are performed and if they can be grouped in smaller data structures.

    BTW: If your university prefers PHP but accepts blocking large parts of the RAM (6 million hash entries can easily result in 1GB or more memory consumption) something seems terribly wrong...

Re^5: Fast(er) serialization in Perl
by The Perlman (Scribe) on Apr 11, 2010 at 14:11 UTC
    is 'NM_001005525' different from 'NM_1005525' ?

    if not you have a serious bug...

    if yes

    90% of your keys have 6 digits putting this data into an array with 1 million entries seems reasonable... resulting in 2 MB of memory consumption if you can limit your values to 64536 numbers (it's a counter isn't it?)

    the other keys have 9 digits, so it seems you are coding your genes in groups of 3 digits. All of them start with "001"

    So generally - from what you show - a hash of arrays seems reasonable where the hash key represents the first 3 digits and the array the rest.

      OK, I understand what you are saying.
      I will try this direction, and hope I manage to save some time.
      Thanks for your help
        BTW: do you use this hash data read-only? if not, you should care about simultaneous calls of your script. And if your only accessing a relatively "small" number of entries, better chose one of the flat-file DB solutions mentioned above.