Re^2: Fast(er) serialization in Perl

Replies are listed 'Best First'.
Re^3: Fast(er) serialization in Perl by The Perlman (Scribe) on Apr 11, 2010 at 12:41 UTC
>I want to keep the hashes functional as I told you further down you can use Tie::Hash to keep the interface functional. IMHO there are no general solutions faster than storable ! * You need to provide more infos. If your php is calling your perlscript you should check if you can hold the data structure in memory. You may also check if you're not running into RAM problems causing massive swappings. I once speeded up a program just by trasforming a huge hash into a hash of hashes (by halving the keys). Since the system only retrieved the sub-hashes actually needed from disk swap, I had a fantastic speed gain. If this is transferable to your case is unknown since you do not provide enough infos... Footnote: (*) from Storable "SPEED The heart of Storable is written in C for decent speed. Extra low-level optimizations have been made when manipulating perl internals, to sacrifice encapsulation for the benefit of greater speed." IMHO it's evident that you need to invest brain to achieve further speed gains!	[reply]
Re^4: Fast(er) serialization in Perl by mrguy123 (Hermit) on Apr 11, 2010 at 13:33 UTC
This is how the hash basically looks (it goes on for about 6 million lines (genes)): $VAR1 = { 'microT' => { 'mmu-miR-704' => { 'NM_009309' => '1', 'NM_133983' => '1', 'NM_175563' => '1', 'NM_010889' => '1', 'NM_008302' => '1', 'NM_022023' => '1', 'NM_009567' => '1', 'NM_172938' => '1', 'NM_029777' => '3', 'NM_134189' => '1', 'NM_175025' => '1', 'NM_177327' => '1', 'NM_026807' => '1', 'NM_178779' => '3', 'NM_010770' => '1', 'NM_031998' => '1', 'NM_145584' => '2', 'NM_207682' => '1', 'NM_001005525' => '1', 'NM_080853' => '1', 'NM_145519' => '1', 'NM_031249' => '1', 'NM_172923' => '1', 'NM_001008700' => '1', 'NM_198617' => '1', 'NM_027400' => '1', 'NM_026406' => '2', 'NM_021296' => '2', 'NM_027652' => '1', 'NM_001045530' => '1', 'NM_018830' => '1', 'NM_025314' => '1', 'NM_009041' => '1', 'NM_026829' => '3', 'NM_026618' => '1', 'NM_027472' => '1', 'NM_027870' => '1', 'NM_001033239' => '1', 'NM_026348' => '1', 'NM_008223' => '1', 'NM_009595' => '2', 'NM_146094' => '1', 'NM_144945' => '1', 'NM_019510' => '1', 'NM_001033251' => '1', 'NM_001081213' => '3', 'NM_008031' => '1', 'NM_028719' => '1', 'NM_133352' => '1', 'NM_008133' => '1', 'NM_008317' => '1', 'NM_021327' => '1', 'NM_178751' => '1', 'NM_010260' => '1', 'NM_025683' => '1', 'NM_026383' => '1', 'NM_001081367' => '1', 'NM_001033354' => '2', 'NM_026034' => '1', 'NM_173395' => '1', 'NM_010762' => '1', 'NM_024432' => '1', 'NM_175113' => '1', 'NM_001077425' => '1', 'NM_026374' => '1', 'NM_026655' => '1', 'NM_177345' => '1', 'NM_027412' => '1', 'NM_183187' => '1', 'NM_016687' => '1', 'NM_175640' => '1', 'NM_007559' => '1', 'NM_011269' => '1', 'NM_010252' => '1', 'NM_019657' => '1', [download] I'm not very familiar with Tie::Hash, so if you can give me a quick heads up on how I can use it to save time it would be great	[reply] [d/l]
Re^5: Fast(er) serialization in Perl by The Perlman (Scribe) on Apr 11, 2010 at 13:47 UTC
> I'm not very familiar with Tie::Hash, so if you can give me a quick heads up on how I can use it to save time it would be great Wouldn't it be even greater if you try to read the detailled docs and tell us what you don't understand? ;-) Your hash really looks like a perverted array ... try to figure out how many lookups are performed and if they can be grouped in smaller data structures. BTW: If your university prefers PHP but accepts blocking large parts of the RAM (6 million hash entries can easily result in 1GB or more memory consumption) something seems terribly wrong...	[reply]
Re^5: Fast(er) serialization in Perl by The Perlman (Scribe) on Apr 11, 2010 at 14:11 UTC
is 'NM_001005525' different from 'NM_1005525' ? if not you have a serious bug... if yes 90% of your keys have 6 digits putting this data into an array with 1 million entries seems reasonable... resulting in 2 MB of memory consumption if you can limit your values to 64536 numbers (it's a counter isn't it?) the other keys have 9 digits, so it seems you are coding your genes in groups of 3 digits. All of them start with "001" So generally - from what you show - a hash of arrays seems reasonable where the hash key represents the first 3 digits and the array the rest.	[reply]
Re^6: Fast(er) serialization in Perl by mrguy123 (Hermit) on Apr 11, 2010 at 15:16 UTC
Re^7: Fast(er) serialization in Perl by The Perlman (Scribe) on Apr 11, 2010 at 18:23 UTC
Re^3: Fast(er) serialization in Perl by The Perlman (Scribe) on Apr 11, 2010 at 13:17 UTC
the web based part is actually in PHP you should benchmark if storable is really your bottleneck, starting a non persistent perl-process takes some time ...	[reply]
Re^4: Fast(er) serialization in Perl by mrguy123 (Hermit) on Apr 11, 2010 at 13:38 UTC
The retrieval of the hash takes 12 seconds (out of less than 20 secs overall) so if I can take that number down a bit, I'm happy	[reply]
Re^5: Fast(er) serialization in Perl by The Perlman (Scribe) on Apr 11, 2010 at 13:49 UTC
try measuring the memory consumption ...	[reply]
Re^5: Fast(er) serialization in Perl by Marshall (Canon) on Apr 12, 2010 at 22:29 UTC
I also tried to store the data in a DB, but the program then requires many DB calls which takes too much time. Long term it sounds like a "real" DB is the way to go. I've been experimenting with MySQL and the performance (when you are careful and get the tables optimized for your app) is fantastic. I don't understand the type of queries that you are making to this huge hash structure - there must be a lot of queries for the app to take 8 seconds past the 12 seconds to load the hash. A "flat" and appropriately indexed SQL DB can be rocket fast - the idea is to push the logic to collect the data for query X into the DB (i.e. get a result set, not data that you collect into a result from multiple queries.) Books that I would recommend are: Learning SQL by Alan Beaulieu MySQL in a Nutshell by Russell Dyer (also has description of Perl DBI and PHP I/F) Your hash tables are huge. As a possible intermediate step, you could make a Perl server that is initialized with these huge hash tables. Have clients connect to it and ask questions that translate very directly into hash table queries. That would save the 12 seconds of loading the hash tables. You don't mention how many clients could be connected to such an app, but it could be that a single process, and processing a queue one request at a time would be just fine - doing better than 12 seconds ought to be easy. Other solutions fastCGI or modPerl are good, but I worry about running your machine out of memory and disk thrashing. Could you give an example of the type of query that you are running against this hash table structure?	[reply]

>I want to keep the hashes functional

as I told you further down you can use Tie::Hash to keep the interface functional.

IMHO there are no general solutions faster than storable ! *

You need to provide more infos.

If your php is calling your perlscript you should check if you can hold the data structure in memory.

You may also check if you're not running into RAM problems causing massive swappings.

I once speeded up a program just by trasforming a huge hash into a hash of hashes (by halving the keys). Since the system only retrieved the sub-hashes actually needed from disk swap, I had a fantastic speed gain.

If this is transferable to your case is unknown since you do not provide enough infos...

Footnote: (*)

from Storable
"SPEED
The heart of Storable is written in C for decent speed. Extra low-level optimizations have been made when manipulating perl internals, to sacrifice encapsulation for the benefit of greater speed."

[reply]

$VAR1 = {
          'microT' => {
                        'mmu-miR-704' => {
                                           'NM_009309' => '1',
                                           'NM_133983' => '1',
                                           'NM_175563' => '1',
                                           'NM_010889' => '1',
                                           'NM_008302' => '1',
                                           'NM_022023' => '1',
                                           'NM_009567' => '1',
                                           'NM_172938' => '1',
                                           'NM_029777' => '3',
                                           'NM_134189' => '1',
                                           'NM_175025' => '1',
                                           'NM_177327' => '1',
                                           'NM_026807' => '1',
                                           'NM_178779' => '3',
                                           'NM_010770' => '1',
                                           'NM_031998' => '1',
                                           'NM_145584' => '2',
                                           'NM_207682' => '1',
                                           'NM_001005525' => '1',
                                           'NM_080853' => '1',
                                           'NM_145519' => '1',
                                           'NM_031249' => '1',
                                           'NM_172923' => '1',
                                           'NM_001008700' => '1',
                                           'NM_198617' => '1',
                                           'NM_027400' => '1',
                                           'NM_026406' => '2',
                                           'NM_021296' => '2',
                                           'NM_027652' => '1',
                                           'NM_001045530' => '1',
                                           'NM_018830' => '1',
                                           'NM_025314' => '1',
                                           'NM_009041' => '1',
                                           'NM_026829' => '3',
                                           'NM_026618' => '1',
                                           'NM_027472' => '1',
                                           'NM_027870' => '1',
                                           'NM_001033239' => '1',
                                           'NM_026348' => '1',
                                           'NM_008223' => '1',
                                           'NM_009595' => '2',
                                           'NM_146094' => '1',
                                           'NM_144945' => '1',
                                           'NM_019510' => '1',
                                           'NM_001033251' => '1',
                                           'NM_001081213' => '3',
                                           'NM_008031' => '1',
                                           'NM_028719' => '1',
                                           'NM_133352' => '1',
                                           'NM_008133' => '1',
                                           'NM_008317' => '1',
                                           'NM_021327' => '1',
                                           'NM_178751' => '1',
                                           'NM_010260' => '1',
                                           'NM_025683' => '1',
                                           'NM_026383' => '1',
                                           'NM_001081367' => '1',
                                           'NM_001033354' => '2',
                                           'NM_026034' => '1',
                                           'NM_173395' => '1',
                                           'NM_010762' => '1',
                                           'NM_024432' => '1',
                                           'NM_175113' => '1',
                                           'NM_001077425' => '1',
                                           'NM_026374' => '1',
                                           'NM_026655' => '1',
                                           'NM_177345' => '1',
                                           'NM_027412' => '1',
                                           'NM_183187' => '1',
                                           'NM_016687' => '1',
                                           'NM_175640' => '1',
                                           'NM_007559' => '1',
                                           'NM_011269' => '1',
                                           'NM_010252' => '1',
                                           'NM_019657' => '1',
[download]

[reply]
[d/l]

> I'm not very familiar with Tie::Hash, so if you can give me a quick heads up on how I can use it to save time it would be great

Wouldn't it be even greater if you try to read the detailled docs and tell us what you don't understand? ;-)

Your hash really looks like a perverted array ...

try to figure out how many lookups are performed and if they can be grouped in smaller data structures.

BTW: If your university prefers PHP but accepts blocking large parts of the RAM (6 million hash entries can easily result in 1GB or more memory consumption) something seems terribly wrong...

[reply]

if not you have a serious bug...

if yes

90% of your keys have 6 digits putting this data into an array with 1 million entries seems reasonable... resulting in 2 MB of memory consumption if you can limit your values to 64536 numbers (it's a counter isn't it?)

the other keys have 9 digits, so it seems you are coding your genes in groups of 3 digits. All of them start with "001"

So generally - from what you show - a hash of arrays seems reasonable where the hash key represents the first 3 digits and the array the rest.

[reply]

the web based part is actually in PHP

you should benchmark if storable is really your bottleneck, starting a non persistent perl-process takes some time ...

[reply]

The retrieval of the hash takes 12 seconds (out of less than 20 secs overall) so if I can take that number down a bit, I'm happy

[reply]

try measuring the memory consumption ...

[reply]

I also tried to store the data in a DB, but the program then requires many DB calls which takes too much time.

Long term it sounds like a "real" DB is the way to go. I've been experimenting with MySQL and the performance (when you are careful and get the tables optimized for your app) is fantastic. I don't understand the type of queries that you are making to this huge hash structure - there must be a lot of queries for the app to take 8 seconds past the 12 seconds to load the hash. A "flat" and appropriately indexed SQL DB can be rocket fast - the idea is to push the logic to collect the data for query X into the DB (i.e. get a result set, not data that you collect into a result from multiple queries.)

Books that I would recommend are:
Learning SQL by Alan Beaulieu
MySQL in a Nutshell by Russell Dyer (also has description of Perl DBI and PHP I/F)

Your hash tables are huge. As a possible intermediate step, you could make a Perl server that is initialized with these huge hash tables. Have clients connect to it and ask questions that translate very directly into hash table queries. That would save the 12 seconds of loading the hash tables. You don't mention how many clients could be connected to such an app, but it could be that a single process, and processing a queue one request at a time would be just fine - doing better than 12 seconds ought to be easy. Other solutions fastCGI or modPerl are good, but I worry about running your machine out of memory and disk thrashing.

Could you give an example of the type of query that you are running against this hash table structure?

[reply]