Juba has asked for the wisdom of the Perl Monks concerning the following question:

I have adopted some suggestions that were given before. My code still blows off the memory when I use a very large infile. Can somebody please help me in optimizing this? Thanks a lot
#!/usr/bin/perl -w #This code joins randomly haplotypes two by two from the output given +by Eli's program with an arbitrary number of populations. It needs a +2 allele output. It is the infile for R calculation. if ( @ARGV != 1 ) { print "incorrect usage ---- TYPE IN COMMAND LINE: perl eli+.pl inf +ile\n"; exit(); } print "What's the OUTFILE?\n"; $OUTFILE = <STDIN>; open (OUT, ">$OUTFILE") or die "could not create $OUTFILE\n"; open (IN, $ARGV[0]); #open the first argument #transform the input list in an array of arrays @total = (); @haplotypes=(); $currentPop = 0; $sampleSize={}; # This will hold the sample size for each population + while(<IN>) { chomp; if (/^(\-{0,1}\d+\t-{0,1}\d+)/) { $sampleSize{$currentPop}++; @temp = split; push @{$haplotypes{$currentPop}}, [@temp]; }elsif (/segsites: (\d+)/) { @{$TempList{$currentPop}}=(); while(@{$haplotypes{$currentPop}}) { push(@{$TempList{$currentPop}}, + splice(@{$haplotypes{$currentPop}}, rand(@{$haplotypes{$currentPop}}), 1)) + } @{$haplotypes{$currentPop}} = @{$TempList{$currentPop}}; push @total, $currentPop; $currentPop++; + } + } foreach $_ (@total) { print "Population $_\n"; print OUT "Population $_\n"; print "$sampleSize{$_}\n"; for($i = 0; $i < $sampleSize{$_}/2; $i++) { @pair =(); @pair = splice @{$haplotypes{$_}}, 0, 2; print "$pair[0][0]\t$pair[1][0]\t$pair[0][1]\t$pair[1][1]\t\n"; print OUT "$pair[0][0]\t$pair[1][0]\t$pair[0][1]\t$pair[1][1]\t\n"; + } + } print "$currentPop\n";
The infile looks like this with hundreds of thousands of repeats

segsites: 2 tMRCA: 0.6398829 Fsense 1 founders: 2 -4 0 -4 0 -1 1 -4 0 -segsites: 2 tMRCA: 0.3395337 Fsense 1 founders: 2 0 0 0 0 -3 -1 0 0

Replies are listed 'Best First'.
Re: still memory leak
by BrowserUk (Patriarch) on Aug 05, 2004 at 02:12 UTC

    Your program is not leaking memory, it is consuming it! By design. Your design.

    You've added -w, but you still haven't used strict, and that is concealing one of the reasons you are using a lot of memory.

    At least way more than you need to.

    @haplotypes = (); ## THIS IS AN ARRAY DECLARATION ..... push @{ $haplotypes{ $currentPop} }, [ @temp ]; ## THIS ...........^............^...................is a hash ## with a numeric key, that increments from zero, by 1!

    Assuming your figure of "hundreds of thousands of repeats" means 100,000, then simply changing

    $haplotypes{ $currentPop} to $haplotypes[ $currentPop ]

    in five places in your code will save you 11 MB. If it means 500,000 thousand, then 55 MB saved.

    This:

    @temp = split; push @{$haplotypes{$currentPop}}, [@temp];

    would be better written as

    push @{ $haplotypes[ $currentPop ] }, [ split ];

    And may save some memory through reduced fragmentation? And this:

    @{$TempList{$currentPop};} = (); while (@{$haplotypes{$currentPop};}) { push @{$TempList{$currentPop};}, splice(@{$haplotypes{$current +Pop};}, rand @{$haplotypes{$currentPop};}, 1); } @{$haplotypes{$currentPop};} = @{$TempList{$currentPop};};

    Re-written as:

    use List::Util qw[ shuffle ]; ## near the top of your program. ... @{ $haplotypes{$currentPop } } = shuffle @{ $haplotypes{$currentPo +p } };

    may save some more for the same reason. Either way, it's clearer.

    And finally , you are carefully squirreling away the $current population in @total

    push @total, $currentPop;

    Which you then use to index your $sampleSize hash. But $currentPop is simply a counter that gets incremented from 0.

    So:foreach $_ ( @total ) {

    Is identical to: foreach $_ ( 0 .. $currentPop ) {

    except that it doesn't create a 10 MB array in the process.

    But this gets worse.

    In your final for loop, you use $_ to index into $sampleSize{ $_ }.

    And %samplesize is a hash (despite the fact that you declare a hashref with the same name at the top of your program), that is index by $currentPop.

    Now The value of $currrentPop that is being used as the key in %sampeSize will not be contiguous, making %sampleSize a "sparse array" and a hash a good choice. But..

    You are pushing sequencial integer values ($currentPop) into an array (@total) which you the use to iterate the sparse (non-sequential) keys of %sampleSize? Which means that one in 4 (or more or less depending on how representative your sample data is) of the iterations of the inner for loop that is doing the printing, the controlling maximul value

    for($i = 0; $i < $sampleSize{$_}/2; $i++) {

    Is going to be an autovivified undef divided by 2! Which with -w in force (assuming you didn't just tack that in for posting) should be causing reams of

    Use of uninitialized value in division (/)...
    errors to be printed.

    Which leads me to believe that you not telling the whole story and disinclines me to further analysis or help.

    Try making your code compile and run under warnings and strict and then come back and you might get some more help.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Well, thanks for all the tips. I have already compiled the code and it works for say 50000 repeats without any error message. I am really telling the whole story. I'll try to incorporate all your suggestions. Thanks again