comment on

Howdy - I am trying to process two sets of moderately large data files - information in one (A-SET) set of files has to be checked against information in the other set (B-SET).
For smaller files I would read both sets into memory but second set is too large - combined size is about 0.5GB.

However A-SET fairly small - about 6-7MB total with largest file being about 1MB and 70,000 lines.
For each line in A-SET I do some processing and store information in the following hash:

'MM-001*' => {
  '556*' => 
  [
    [['10',18],['12',2],['9',2],['0',2]],
    {'11*' => 1,'3*' => 0,'16*' => 2,'2*' => 3}
  ]
  etc for nonconsecutive keys like '567' ....
},
etc for nonconsecutive keys like 'NN-003'
[download]

I need hashes because I need to quickly access info for a key combination like eg. MM-001*/556*/3* (I marked entries that need as hash keys as with asterisks at the end.)
This is just an example entry but length of content may vary - this is average length of value part. So Tie::Hash is probably not very easily applicable here.

Data from all files from A-SET is to be slurped in is stored in one LONG_HASH.
Then I planned to loop over files in B-SET, read one at a time in - and for each entry in that LONG_HASH and each do different type of processing on that file (depending on precise values in that hash example above).

What puzzles me is that I can without a problem tackle one family of A-SET files of combined size of about 3MB - memory spike never exceeds 150MB. And none of those files is longer than 20,000 lines.
But when I try to run same script against second family of A-SET files memory explodes when parsing longer files - with about 60,000 lines.
Memory suddenly explodes to about 1.5GB or more.
But file is mere 1MB in size... less than combined size of the other family of A-SETs .....

The only difference I see is the granularity of data - first family of A-SETS have relatively small set of primary keys (eg 'MM-001*' etc) - less than 300. And longer (500+) sets of secondary keys (eg. '556*')
While second family of A-SETs has large (2000+?) set of primary keys and fairly short sets of secondary keys (10-30).

It seems that memory explodes when LONG_HASH accumulates about 300 primary keys.

I would appreciate any wisdom on how to tackle this situation. Performance (total time of running this code) is not super-important - but from experiments on smaller families of A-SETs in B-SETs show that logic involved can take 20+ mins (even if both A- and B-SETs are in memory).
If I have to do i/o for each combination of sets from both A and B that will be close to 6000 times files read-in, parsed, used then purged. may add up a lot of extra time.

Thanks for any pointers/info or critique,

In reply to running out of memory when slurping in data by woland99

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.