However A-SET fairly small - about 6-7MB total with largest file being about 1MB and 70,000 lines.
For each line in A-SET I do some processing and store information in the following hash:
'MM-001*' => { '556*' => [ [['10',18],['12',2],['9',2],['0',2]], {'11*' => 1,'3*' => 0,'16*' => 2,'2*' => 3} ] etc for nonconsecutive keys like '567' .... }, etc for nonconsecutive keys like 'NN-003'
I need hashes because I need to quickly access info for a key combination like eg. MM-001*/556*/3* (I marked entries that need as hash keys as with asterisks at the end.)
This is just an example entry but length of content may vary - this is average length of value part. So Tie::Hash is probably not very easily applicable here.
Data from all files from A-SET is to be slurped in is stored in one LONG_HASH.
Then I planned to loop over files in B-SET, read one at a time in - and for each entry in that LONG_HASH and each do different type of processing on that file (depending on precise values in that hash example above).
What puzzles me is that I can without a problem tackle one family of A-SET files of combined size of about 3MB - memory spike never exceeds 150MB. And none of those files is longer than 20,000 lines.
But when I try to run same script against second family of A-SET files memory explodes when parsing longer files - with about 60,000 lines.
Memory suddenly explodes to about 1.5GB or more.
But file is mere 1MB in size... less than combined size of the other family of A-SETs .....
The only difference I see is the granularity of data - first family of A-SETS have relatively small set of primary keys (eg 'MM-001*' etc) - less than 300. And longer (500+) sets of secondary keys (eg. '556*')
While second family of A-SETs has large (2000+?) set of primary keys and fairly short sets of secondary keys (10-30).
It seems that memory explodes when LONG_HASH accumulates about 300 primary keys.
I would appreciate any wisdom on how to tackle this situation. Performance (total time of running this code) is not super-important - but from experiments on smaller families of A-SETs in B-SETs show that logic involved can take 20+ mins (even if both A- and B-SETs are in memory).
If I have to do i/o for each combination of sets from both A and B that will be close to 6000 times files read-in, parsed, used then purged. may add up a lot of extra time.
Thanks for any pointers/info or critique,
JTIn reply to running out of memory when slurping in data by woland99
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |