memory issues

asqwerty has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

I'm trying to do a simple meta analysis in a few databases. The format of DBs is this:

CHR1 CHR2 SNP1 SNP2 OR_INT STAT P
17 18 rs9912311 rs9965425 0.9307 0.06328 0.8014
17 18 rs9912311 rs9963148 0.9307 0.06328 0.8014
17 18 rs9912311 rs9959874 0.9668 0.01788 0.8936
17 18 rs9912311 rs1893506 1.091 0.07564 0.7833
17 18 rs9912101 rs9965425 0.9003 0.1249 0.7238
17 18 rs9912101 rs9963148 0.9003 0.1249 0.7238
17 18 rs9912101 rs9959874 0.9507 0.0376 0.8462
17 18 rs9912101 rs1893506 1.029 0.007849 0.9294
17 18 rs9905581 rs9965425 0.9003 0.1249 0.7238
[download]

I have 5 DBs with around 30k lines each. So I write these lines,

use strict; use warnings;
use File::Slurp qw(read_file);
use Math::CDF qw(qnorm pnorm);
use List::MoreUtils qw(uniq);

my $ofile = "meta1.txt";
my @ifiles = @ARGV;
my %ipairs;
my @lpairs;

foreach my $ifile (@ifiles){
    (my $fk) = $ifile =~ /^(.*)\_sets.*/;
    my %ldata = reverse map {/^(.*(rs\d{1,20}\s+rs\d{1,20}).*)$/} grep
+ {/.*rs\d{1,20}\s+rs\d{1,20}.*/} read_file $ifile;
    foreach my $dline (sort keys %ldata){
        push @lpairs, $dline;
        ($ipairs{$fk}{$dline}{'head'}, $ipairs{$fk}{$dline}{'effect'},
+ $ipairs{$fk}{$dline}{'pvalue'}) = $ldata{$dline} =~ /^(.*)\s+(\d\.\d
++)\s+\d\.\d+\s+(\d\.\d+)$/;
    }
}

@lpairs = uniq @lpairs;

open OF, ">$ofile";

my $head = "CHR1 CHR2 SNP1 SNP2 P N";
print OF "$head\n";

foreach my $pair (@lpairs) {
    my $n = 0;
    my $z = 0;
    my $hl;
    my $pvalue = 0;
    my $fk;
    foreach $fk (%ipairs) {
        if($ipairs{$fk}{$pair}{'pvalue'}){
            unless($hl){
                $hl = $ipairs{$fk}{$pair}{'head'};
            }
            $n++;
            $z+= qnorm($ipairs{$fk}{$pair}{'pvalue'})
        }
    }
    if($n>2){
        $z = $z/sqrt($n);
        $pvalue = pnorm($z);
    }    
    if ($pvalue) {
        #printf "$pair -> %.4f\n", $pvalue; 
        printf OF "$hl %.4f $n\n", $pvalue;
    }

}

close OF;
[download]

Actually, the program works fine. However my problem is that it incrementally consumes memory until it gets the 32Mb. Finally the system kill the job by itself, so my program never finish.

So, I have two questions.

Why is this happening? The high memory waste begins after all the info is already loaded in the hash. In oder words, in the loop when calculations take place and results are writting to disk.

There is any workaround to sort this problem? Actually I was thinking in writing intermediate results to disks but I'm not yet sure how to do it.

Comment on memory issues Select or Download Code

Replies are listed 'Best First'.
Re: memory issues by BrowserUk (Patriarch) on Jan 28, 2013 at 09:13 UTC
The problem is that this line: `if($ipairs{$fk}{$pair}{'pvalue'}){` [download] Rather that just testing if that value exists, it is autovivifying (creating) that value in the nested hashes and setting it to null. If you change that line to: `if( exists $ipairs{$fk} && exists $ipairs{$fk}{$pair} && exists $ipairs{$fk}{$pair}{'pvalue'} ){` [download] It should prevent the runaway memory growth. As a nice side-effect, your program should run substantially faster also. BTW. I assume you mean 32GB not 32MB? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: memory issues by asqwerty (Acolyte) on Jan 28, 2013 at 09:24 UTC
Thanks!!!! It works fine now and, as you said, substantially faster also. This is what I did, foreach my $pair (@lpairs) { my $n = 0; my $z = 0; my $hl; my $pvalue = 0; my $pvt = 0; my $fk; foreach $fk (%ipairs) { if( exists $ipairs{$fk} && exists $ipairs{$fk}{$pair} && exi +sts $ipairs{$fk}{$pair}{'pvalue'}){ #if($ipairs{$fk}{$pair}{'pvalue'}){ $pvt = $ipairs{$fk}{$pair}{'pvalue'}; if($pvt){ unless($hl){ $hl = $ipairs{$fk}{$pair}{'head'}; } $n++; $z+= qnorm($ipairs{$fk}{$pair}{'pvalue'}); } } } if($n>2){ $z = $z/sqrt($n); $pvalue = pnorm($z); } if ($pvalue) { #printf "$pair -> %.4f\n", $pvalue; printf OF "$hl %.4f $n\n", $pvalue; } } [download]	[reply] [d/l]
Re^3: memory issues by Anonymous Monk on Jan 28, 2013 at 17:01 UTC
Just a small hint to reduce the verboseness of your code: `($ipairs{$fk}{$dline}{'head'}, $ipairs{$fk}{$dline}{'effect'}, $ip +airs{$fk}{$dline}{'pvalue'}) = $ldata{$dline} =~ /^(.)\s+(\d\.\d+)\s ++\d\.\d+\s+(\d\.\d+)$/;` [download] can be rewritten using a hash slice like this: `( @{ $ipairs{$fk}{$dline} }{qw/head effect pvalue/} ) = $ldata{$dl +ine} =~ /^(.)\s+(\d\.\d+)\s+\d\.\d+\s+(\d\.\d+)$/;` [download] Otherwise, your code benefits from a temporary variable or two. Here I repurpose `$pvt` (not sure if the variable name makes sense after that): `$pvt = $ipairs{$fk}{$pair}; if($pvt->{'pvalue'}){ unless($hl){ $hl = $pvt->{'head'}; } $n++; $z+= qnorm($pvt->{'pvalue'}); }` [download] (This only works because there already exists a hash reference at `$ipairs{$fk}{$pair}`. It would not work if you tried to say `$pvt = {}`, but `%$pvt = ()` would. It's all reference magic and not really easy to explain unless you first understand pointers.) (Of course, the hash slice can be rewritten using a temporary variable, too. It's always a good idea to use temporary variables if it makes your code easier to understand. Triply a good idea if it reduces repetition.)	[reply] [d/l] [select]
Re^2: memory issues by asqwerty (Acolyte) on Jan 28, 2013 at 09:25 UTC
And you are right. It is 32 GB. :-)	[reply]
Re: memory issues by Anonymous Monk on Jan 28, 2013 at 09:08 UTC
Why is this happening? You wrote it that way, you're storing that much data in memory Observe `$ perl -MDevel::Size=:all -le " @F = 1 .. (5 * 30 * 1024 ); @F{@F}=@F +; print total_size($_) for \@F, \%F " 6758436 13225503` [download] 5 files, 30k lines each, stored in hash, and stored in array, 6.5MiB and 13MiB respectively ( 19.5MiB combined) You actually store three times as much data, mostly duplicated The solution, store less data, store data on disk, get a better system, or lift ulimits on your account	[reply] [d/l]