Outputting Huge Hashes

bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm a bit stuck trying to output a series of very large hashes. Using what I thought would be memory-efficient code:

open(AGG, '>', $aggregate_file);

while ( my ($key1, $val1) = each %noninteracting_hash ) {
        while ( my ($key2, $val2) = each %$val1 ) {
                print AGG join(
                        "\t",
                        $key1,
                        $key2,
                        $noninteracting_hash{$key1}{$key2},
                        $interacting_hash{$key1}{$key2},
                        $literature_hash{$key1}{$key2},
                        $predicted_hash{$key1}{$key2}
                        ), "\n";
                }
        }

close(AGG);
[download]

When I say huge hashes, what I mean is that each hash is keyed by a 10-char string and in total there are 3,895,529,794 elements in each entire HoH (e.g. the output to file should contain four-billion rows or so). The program reproducibly crashes after 6.8 million rows (variance in crash point from run to run is about 30 rows).

The specific error is my favourite: Out of memory!

Can anyone see what might be going on here, as I'm kinda lost!

Comment on Outputting Huge Hashes Select or Download Code

Replies are listed 'Best First'.
Re: Outputting Huge Hashes by brian_d_foy (Abbot) on Jan 29, 2006 at 18:47 UTC
What are those other hashes? As others suspect, they may be the problem. You could use Devel::Size's `total_size` function to inspect their growth if you suspect they are the problem. Do you get the right results in the output file? A lot of blank columns could indicate that you're making neew entries in those other hashes. You could use tied hashes instead. Something such as DBM::Deep can store the data on disk instead of in memory so you don't suck up all of your RAM. You could also shove all of this into a real database server and get more fancy with the problem. :) Good luck! -- brian d foy <brian@stonehenge.com> Subscribe to The Perl Review	[reply] [d/l]
Re^2: Outputting Huge Hashes by bernanke01 (Beadle) on Jan 31, 2006 at 16:38 UTC
Yup, I actually started off using DBM::Deep, but realized that my dataset could squeeze into memory and so switched away from that. The DBM::Deep implementation takes quite a bit longer to run than the in-memory version (30 days vs. 12 hours based on two trials). Also, the output (until the program crashes at least) looks perfectly normal. No blank or duplicate rows appear, its just as some point the system runs out of memory during output.	[reply]
Re: Outputting Huge Hashes by ysth (Canon) on Jan 29, 2006 at 18:34 UTC
Are there really already entries in %interacting_hash, %literature_hash, and %predicted_hash for each entry in %noninteracting_hash? If not, you will be creating them. If that's not it, what version of perl are you using? Can you try a newer version, in case there was a fixed leak?	[reply]
Re^2: Outputting Huge Hashes by bernanke01 (Beadle) on Jan 30, 2006 at 22:30 UTC
Indeed there were supposed to be, and testing indicates that there are. I'm using 5.8.7	[reply]
Re: Outputting Huge Hashes by lima1 (Curate) on Jan 29, 2006 at 18:00 UTC
i assmue that the 3 other hashes grow because of autovivication?	[reply]
Re^2: Outputting Huge Hashes by bernanke01 (Beadle) on Jan 30, 2006 at 22:30 UTC
When you first said this, I though of course and went to track it down but... that's not what's happening at all! I've found a simple test-case that I'll place in another node because I think it might be of general interest.	[reply]
Re: Outputting Huge Hashes by TedPride (Priest) on Jan 30, 2006 at 02:09 UTC
Some more information about the underlying problem might be good. Why do you need a set of hashes with nearly 4 billion elements? What sort of data is it, and how are you trying to manipulate the data? Perl is not exactly the most memory-efficient language in the world, and working with smaller chunks of data would be smart if at all possible. For instance, if you know that your keys are going to be random 10-char strings, you could save all record data in files named according to the first one or two characters of each key, process the files one by one, then merge the results into a master file.	[reply]