Running out of resources while data munging

dnquark has asked for the wisdom of the Perl Monks concerning the following question:

I've been processing ~200 meg files of ASCII records by reading them into a hash of array references, and then writing each @{ $hash{$key} } record into its own file. After the script munges a couple of files, it ends up using all the resources and slows down / slows the machine down to a crawl.

I start with a list of (*.gz) file names, for every file name I call processFile($fn):

sub processFile{
$fn = shift;
#open a file through a pipe
my %hash = ()
while( <INPUT> ){
#populate hash
}

#iterate over keys outputting data
}
[download]

What's happening is that the script doesn't appear to be flushing memory on every call of the processFile, so pretty quickly it's using all the RAM and slows down to a crawl. That's my guess, anyway -- all I can see is memory usage growing uniformly. Is Perl not garbage collecting the freed hashes? Is there anything one can do about it? Thanks for any help.

Comment on Running out of resources while data munging Download Code

Replies are listed 'Best First'.
Re: Running out of resources while data munging by ikegami (Patriarch) on Jun 24, 2008 at 03:50 UTC
The hash will be emptied when `processFile` exits, unless you do something that prevents it from being freed (such as creating a circular reference or returning a reference to the hash). The memory will be returned to Perl to reuse, but not necessarily to the OS. That means you should expect the memory usage to remain constant.	[reply] [d/l]
Re: Running out of resources while data munging by pc88mxer (Vicar) on Jun 24, 2008 at 04:24 UTC
I've been processing ~200 meg files of ASCII records by reading them into a hash of array references, and then writing each @{ $hash{$key} } record into its own file. It sounds like you are just binning the lines based on some criteria. Can you output the lines directly without first collecting them in the hash? Something like: `my %FH; while (<INPUT>) { my $key = ...determine key from $_... my $fh = $FH{$key} \|\|= open_fh_for_key($key); print $fh $_; } for my $fh (vals(%FH)) { close $fh }; # ... or just wait for %FH to go out of scope` [download] If this approach would work but you would open too many files, there's work-arounds for that situation.	[reply] [d/l]
Re: Running out of resources while data munging by CountZero (Bishop) on Jun 24, 2008 at 04:46 UTC
I think you are only showing us part of the program and have left out the most important part: how does your `procesfile` subroutine tells the calling part what is in `%hash`? Do you perchance return a reference to `%hash`? That would account for the increase in memory. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: Running out of resources while data munging by roboticus (Chancellor) on Jun 24, 2008 at 12:42 UTC
dnquark: I'm not answering your question in this node (others have already done so), just offering a tip I find useful in that situation: I frequently rip through huge files to collect statistics, spread among multiple outputs, etc. What I often find useful is to run the data through `sort` first. If the task suggests a good sort key, this can often remove the need to collect all the information into a hash first. It doesn't usually cost much time, as when I rip through the sorted file, caching lets me use the file image already in RAM. ...roboticus	[reply] [d/l]
Re: Running out of resources while data munging by Fletch (Bishop) on Jun 24, 2008 at 12:43 UTC
You might also look into something like BerkeleyDB and keep the collated results in a hash-on-disk instead of in RAM. There'll be a bit more overhead but it might be the trick to go from "MOMMY MOMMY MAKE THE THRASHING STOP" to "working acceptably fast". Also don't discount just getting more RAM. It's probably going to be much cheaper and perform better than you spending significant time working around what can be solved by around $100-200 worth of DIMM or SIMM or whatever they're calling them these days. The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re: Running out of resources while data munging by dragonchild (Archbishop) on Jun 24, 2008 at 13:59 UTC
Maybe you should rethink the whole plan and just use DBM::Deep. This is a textbook usecase for it. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]