woland99 has asked for the wisdom of the Perl Monks concerning the following question:

Howdy - I am trying to process two sets of moderately large data files - information in one (A-SET) set of files has to be checked against information in the other set (B-SET).
For smaller files I would read both sets into memory but second set is too large - combined size is about 0.5GB.

However A-SET fairly small - about 6-7MB total with largest file being about 1MB and 70,000 lines.
For each line in A-SET I do some processing and store information in the following hash:

'MM-001*' => { '556*' => [ [['10',18],['12',2],['9',2],['0',2]], {'11*' => 1,'3*' => 0,'16*' => 2,'2*' => 3} ] etc for nonconsecutive keys like '567' .... }, etc for nonconsecutive keys like 'NN-003'

I need hashes because I need to quickly access info for a key combination like eg. MM-001*/556*/3* (I marked entries that need as hash keys as with asterisks at the end.)
This is just an example entry but length of content may vary - this is average length of value part. So Tie::Hash is probably not very easily applicable here.

Data from all files from A-SET is to be slurped in is stored in one LONG_HASH.
Then I planned to loop over files in B-SET, read one at a time in - and for each entry in that LONG_HASH and each do different type of processing on that file (depending on precise values in that hash example above).

What puzzles me is that I can without a problem tackle one family of A-SET files of combined size of about 3MB - memory spike never exceeds 150MB. And none of those files is longer than 20,000 lines.
But when I try to run same script against second family of A-SET files memory explodes when parsing longer files - with about 60,000 lines.
Memory suddenly explodes to about 1.5GB or more.
But file is mere 1MB in size... less than combined size of the other family of A-SETs .....

The only difference I see is the granularity of data - first family of A-SETS have relatively small set of primary keys (eg 'MM-001*' etc) - less than 300. And longer (500+) sets of secondary keys (eg. '556*')
While second family of A-SETs has large (2000+?) set of primary keys and fairly short sets of secondary keys (10-30).

It seems that memory explodes when LONG_HASH accumulates about 300 primary keys.

I would appreciate any wisdom on how to tackle this situation. Performance (total time of running this code) is not super-important - but from experiments on smaller families of A-SETs in B-SETs show that logic involved can take 20+ mins (even if both A- and B-SETs are in memory).
If I have to do i/o for each combination of sets from both A and B that will be close to 6000 times files read-in, parsed, used then purged. may add up a lot of extra time.

Thanks for any pointers/info or critique,

JT

Replies are listed 'Best First'.
Re: running out of memory when slurping in data
by MidLifeXis (Monsignor) on Nov 18, 2011 at 14:00 UTC

    Can you provide any code? There may be something in your algorithm that is causing some issues.

    --MidLifeXis

Re: running out of memory when slurping in data
by woland99 (Beadle) on Nov 18, 2011 at 14:16 UTC
    Nevermind - think I found the bug - one of files in A-SET had subtle syntax error that my homemade parsing logic did not catch and was causing never-ending loop. I would still appreciate any pointers on effective storing of data that has non-uniform nested structure.

    JT