in reply to Re^3: Attempt to free temp prematurely and unreferenced scalar
in thread Attempt to free temp prematurely and unreferenced scalar

Dear BrowserUk,

After Read-only. I've also just looked through the input file size. I realized that it's still within reasonable size (5k-60k), where I have 1GB RAM.

I've tried to see the size of the some potentially large variable using Devel::Size. All of them are still reasonable (around 20MB).

Apart from my question to your posting. Is there a way to check, which part of my code are producing the "Out Of Memory" message? like stated in my OP.


---
neversaint and everlastingly indebted.......
  • Comment on Re^4: Attempt to free temp prematurely and unreferenced scalar

Replies are listed 'Best First'.
Re^5: Attempt to free temp prematurely and unreferenced scalar
by BrowserUk (Patriarch) on Feb 22, 2006 at 11:08 UTC

    Honestly, without seeing the code, anything would be (another) guess. What are you doing with 60k of input data to create even 1 20MB datastructure?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Dear BrowserUk,
      Basically what my code is doing is to take sets of DNA input sequences, and find a conserved substrings within them.

      The size of the variable expand, because for each length "W" strings from the input sequence I collect the again the substrings of it.

      So here I ran the "main_process" subroutine multiple time given parameter sets (generated with gen_param subroutine).

      Dont' be overwhelmed with my code below. You can ignore much of it. The out of memory message only occur after it completes the first set of parameters, and then it breaks. See the last portion of "main_process" subroutine.

      Really hope to hear from you again.

      ---
      neversaint and everlastingly indebted.......

        Phew! Where to start :) I can't tell you exactly what is blowing your memory, without the sp_neg_eff and suitable data files it's impossible for me to run it.

        Overall, there are several routines that I don't have sight of, lots of loops and lots of arrays. It is very difficult to try and work out where the problem may lie simply by inspection. Nothing has leapt off the page at me as being the obvious cause. It may simply be a cumulative affect of all those loops, arrays and hashes finally consumes your ram completely.

        However, I can possibly point out some stuff that is certainly not helping either your memory consumption or your processing time. This may come in dribs and drabs as I wrap my brain around your code. I'll keep it all in the one post and /msg you if I update it.

        Example 1. Your getSeqfromfasta2lmers routine is doing way more work, and using 3 times as much memory as is necessary.

        sub getSeqfromfasta2lmers { my $file = shift; my @seqs= (); open INFILE, "<$file" or die "$0: Can't open file $file: $!"; my $in = Bio::SeqIO->new(-format => 'fasta', -noclose => 1 , -fh => \*INFILE); while ( my $seq = $in->next_seq() ) { push @seqs, $seq->seq(); } #end while return @seqs; }

        You are calling this routine once outside your main loop and then again inside it each time around the loop.

        1. The first time, you are calling it just to get the number of sequences in the file:
          my $nofseq = scalar( getSeqfromfasta2lmers( $file ) );
        2. Then call it in the loop to load the sequences into an array:
          my @input_seqs = getSeqfromfasta2lmers($file);
        3. Which you immediately follow by assigning the number of sequences to another local variable:
          my $ip = @input_seqs;
        4. And you repeat this last step every time around that main loop.

        As the name of the file never changes, you are re-reading this file many times. And as far as I can tell, you are never modifying the contents of the array, so this is just a waste of cycles.

        As constructed above, the routine read the sequences one at a time and pushes them to a local array. You then return this array as a list to the caller where they are assigned to another array, or in the case of the first call, they are simply counted and then discarded.

        A quick look at the docs for Bio::SeqIO shows that it has a method specially designed for reading all the sequences from a fasta file. Namely ->newFH(). I don't get the logic behind the name, but the use is simple. The following is an (almost) drop in replacement for your version, though it will require a couple of other changes in your code, but they would be best changed anyway.

        sub getSeqfromfasta2lmers { my $file = shift; ## Use a lexical file handle so thaty the file is closed automatic +ally open my fh, "<$file" or die "$0: Can't open file $file: $!"; my $in = Bio::SeqIO->newFH( -format => 'fasta', -fh => $fh ); return <$in>; ### MUST BE CALLED IN A LIST CONTEXT!!!! }

        And then call the routine ONCE at the top of the program, assign the sequences into an array, and the re-use that array each time around the main loop.

        ## Get the sequences. my @input_seqs = getSeqfromfasta2lmers($file); ## And a count of them my $nofseq = @input_seqs;

        And delete the following two lines from the top of the main sub

        sub main_process { .... ## Just reusing the array from outside the loop/sub will save a lot of + time ## and some space. ## DELETE my @input_seqs = getSeqfromfasta2lmers($file); ## This doesn't appear to be used anywhere, but if I missed it and it +is ## replace references to $ip with $nofseq ## DELETE my $ip = @input_seqs;

        What if any difference these changes will make to your overall problem I'm not sure, but they will not harm. It's not very pc to pass data to subs via closure this way, but it avoids messing with references, and you are already passing (too) many arguments to that sub as it is.

        Eample 2. Equally, your routine

        sub getlmersfromseq { my ($seqsarr,$l)= @_; my @lmers; @lmers = map {substrings $_, $l} @{$seqsarr}; my @uniq_lmers = uniq @lmers; return @uniq_lmers; }

        Could be recoded as

        sub getlmersfromseq { my ($seqsarr,$l)= @_; return uniq map {substrings $_, $l} @{$seqsarr}; }

        I've no idea how big those intermediate arrays gets, but they are not helping you in any way.

        Example 3: This won't affect your memory, but it made it easier for me to work out how many time the main loop/sub iterates. You don't need to assign list to arrays in order to use them in foreach loops. Now it is obvious that there are twelve anonymous hashes being returned by the sub.

        sub gen_param { my ( $file, $file_neg, $nofseq ) = @_; my @param_groups; foreach my $wlen ( 8, 15, 20 ) { foreach my $fract ( 0.8, 0.5 ) { foreach my $q ( $nofseq, $nofseq * 1.5 ) { push @param_groups, { file => $file, file_neg => $file_neg, submt_len => 5, submt_d => 1, e => 0, W_size => $wlen, lp => $fract * $wlen, support_threshold => $q, min_inst_lower => $q, min_inst_upper => ( 3 * $q ), polyTA_lim => 0.8, poly_lim => 0.8, }; } } } return @param_groups; }

        There are also a couple of places where you are assigning array references to hash elements like this:

        $hash{ 'some complicated key' } = [@matches];

        In both cases the array are local, and you will save some space by simply doing

        $hash{ 'some complicated key' } = \@matches;

        Example4. Then there are oddities like you are accumulating the returns from the main sub in a hash:

        my $output = main_process( @{$_}{ qw/ file file_neg submt_len submt_d e W_size lp support_thresh +old min_inst_lower min_inst_upper polyTA_lim poly_lim / }); $result{ 'ParamGroup' . $count++ } = $output;

        But the main routine doesn't return anything?

        } # ----- end foreach $mcands ----- return; }

        It will probably be intensely frustrating to you, for me to say that you are going to have to try and simplify your code before you will be able to track down the cause of your problem.

        I doubt these changes individually or collectively will have any great effect on your memory consumption, but they may help you clean up the code enough to let you see where the real problem lies.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.