bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:

Howdy again,

I've hit a memory-related conundrum, and am looking for a bit of advice. For background, this is broadly the same project that I asked about previously, first doing benchmarks to determine the fastest way to find motifs in a sequence, then getting help from BrowserUK on adding threading. Basically, the goal is to search complete genomic sequences and find all occurrences of a number of short motifs.

As the project has evolved, I found the need to allow for ambiguity in my match motifs. Thus the code I had working for exact matches needed to change. I'll give the entire script at the bottom, but here is the relevant function for exact matches:

So the key analysis code is this:

while ( ($pos = index($sequence, $motif, $pos)) >= 0 ) { lock $semaphore; print $fh join("\t", $genome_build, @row, $chr, $pos+1), "\n"; $pos += $len; }

When I go ahead and try to replace that with a regular-expression based parser, I run out of memory. The code I'm replacing with is just:

while ( $sequence =~ /$motif/g ) { lock $semaphore; print $fh join("\t", $genome_build, @row, $chr, pos($sequence) +), "\n"; }

I should specify: when I say "run out of memory" I mean I literally get an Out of memory! message from the Perl interpreter as my script terminates prematurally. I'm very confused why the memory usage would be very different between these two. Any suggestions?

Here's the full output on termination:

Out of memory! Callback called exit at find_and_load_motifs_test.pl line 150. Thread 3 terminated abnormally: Callback called exit at find_and_load_ +motifs_test.pl line 150. Perl exited with active threads: 9 running and unjoined 1 finished and unjoined 0 running and detached

Here's the full code of the script, in case it helps:

### INCLUDES ######################################################### +################################ use strict; use Bio::SeqIO; use Carp; use threads; use threads::shared; # turn off STDOUT buffering to allow smooth printing from multiple thr +eads $| = 1; ### PARAMETERS ####################################################### +################################ my $chr_file = $ARGV[0]; my $motif_file = $ARGV[1]; my $genome_build = $ARGV[2]; my $max_threads = 10; if ( 3 != scalar(@ARGV) ) { croak 'Invalid parameter number'; } elsif ( ! -e $chr_file || ! -T $chr_file ) { croak 'Missing or invalid chromsome-listing file'; } elsif ( ! -e $motif_file || ! -T $motif_file ) { croak 'Missing or invalid motif-list file'; } my $out_file = 'results.out'; ### LOCALS ########################################################### +################################ my @chromosomes; my %threads; my $semaphore : shared; ### LOAD THE CHROMOSOME LIST ######################################### +################################ open(my $fh_chr, '<', $chr_file) or croak "Unable to open chromsome li +st: $chr_file"; while (<$fh_chr>) { s/^\s+//; s/\s+$//; my $row = $_; next() if (!$row); push @chromosomes, $row; } close($fh_chr); ### FUNCTION FOR THREADING ########################################### +################################ sub threading_function($$$) { my $chromosome = $_[0]; my $motif_file = $_[1]; my $fh = $_[2]; # set up paths my $chr = 'chr'.$chromosome; my $directory = $chromosome.'/'; my $fa_file = $chr.'.fa.masked'; my $gz_file = $fa_file.'.gz'; my $fa_path = $directory.$fa_file; my $gz_path = $directory.$gz_file; # decompress sequence file my $tst = `gunzip $gz_path`; # read sequence for motif-scanning my $seqio = Bio::SeqIO->new( -file => "<$fa_path", -format => 'largefasta' ); my $seq = $seqio->next_seq(); my $sequence = $seq->seq(); # open the motif file open(my $fh_motif, '<', $motif_file) or croak "Unable to open +motif file: $motif_file"; # process each motif in turn while (<$fh_motif>) { s/^\s+//; s/\s+$//; my @row = split("\t"); next() if ( 4 != scalar(@row) ); my $motif = pop(@row); while ( $sequence =~ /$motif/g ) { lock $semaphore; print $fh join("\t", $genome_build, @row, $chr +, pos($sequence)), "\n"; } } # close the motif file close($fh_motif); # recompress sequence file $tst = `gzip -9 $fa_path`; } ### THREAD MANAGEMENT ################################################ +################################# # initially all chromosomes are waiting to be run my @waiting = @chromosomes; # open up the output file open(my $fh_out, '>', $out_file) or croak 'Unable to open outfile'; # turn off buffering (make the file-handle "hot") to allow smooth prin +ting from multiple threads select((select($fh_out), $|=1)[0]); # keep going while either chromosomes are waiting or threads are execu +ting while ( scalar(@waiting) || scalar( keys(%threads) ) ) { # Note: I could have used threads->list( threads::running ) to + count the running threads # first check if old threads are finished foreach my $chr ( keys(%threads) ) { # if a thread is finished join it and remove from %thr +eads if ( $threads{$chr}->is_joinable() ) { $threads{$chr}->join(); delete $threads{$chr}; } } # then check if we can start a new thread if ( ( scalar( keys(%threads) ) < $max_threads ) && ( scalar(@ +waiting) ) ) { my $chr = pop(@waiting); $threads{$chr} = threads->new(\&threading_function, $c +hr, $motif_file, $fh_out); } # take a break if we are all full-up on threads to not waste C +PU sleep 60 if ( scalar( keys(%threads) ) >= $max_threads ); }

Replies are listed 'Best First'.
Re: Memory Usage in Regex On Large Sequence
by liverpole (Monsignor) on Sep 25, 2006 at 16:33 UTC
    Hi bernanke01,

    My guess is that your problem may be related to the number of threads you're using.

    I can't tell from your program how many threads you are starting up (it looks like it's dependent on a data file), but I do know that, in situations where I've tried to create hundreds of threads, I often will get the same Out of memory! error.

    The reason that it's misleading is that it doesn't just depend on the number of threads, it also depends on the context, what each thread is doing, and at which time.  That's why it can easily look like a separate issue.

    My recommendation would be to first try limiting the number of threads (unless it is already quite small), and see if the behavior changes at all.  If that doesn't do it, even for a small quantity (eg. less than 20), then perhaps your problem really does lie elsewhere.


    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/

      Hi, good question: it's hard-coded at 10 threads, but I've reduced it down to 3 and replicated the error. I'll set it running with just one and see what happens there.

Re: Memory Usage in Regex On Large Sequence
by BrowserUk (Patriarch) on Sep 25, 2006 at 16:50 UTC

    You'll need to dave_the_m or someone similar to explain why the regex engine would consume more memory than index, even for an apparently similar, simple search, but it would not surprise me if it does.

    In the meantime, have you tried reducing my $max_threads = 10;? Did that make a difference?

    Other questions.

    • Which OS are you using?
    • Which version of Perl?
    • If you are on Win32, could you post the output of
      dumpbin /headers \path\to\your\perl.exe

      In particular, these lines:

      1000000 size of stack reserve 1000 size of stack commit 100000 size of heap reserve 1000 size of heap commit

      See Use more threads. for why.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi Browser!

      Reducing to $max_threads = 3 didn't affect things, but I'll try going to 1 and see what happens there.

      For other things, this is perl 5.8.8 on AIX 5.2, but all has been replicated on 5.8.7 on AIX 5.3 as well.

Re: Memory Usage in Regex On Large Sequence
by dave_the_m (Monsignor) on Sep 25, 2006 at 17:29 UTC
    Questions:

    How many threads are you creating?
    what sorts of strings does $motif contain? Is it anything other than A/C/G/T x N ? If so, what? Otherwise, what's the maximum length of $motif?

    Dave.

      Hi,

      It's 10 threads, but repeats at 3 as well. For now, the motifs are just 6 strings of GCAT of 5 or 6 letters each. In theory I'd like to use more complex motifs of course. So, here are my current motif-list:

      GCGTG GTGCG CACGC CGCAC CACGTG GTGCAC
        In that case, I can't see any good reason why replacing index(..,"GTGCAC",...) with /GTGCAC/ should use any more memory, except that if $sequence is very large, and if somewhere in the program (eg from an included module), $`, $& or $' is used, then perl has to take a complete copy of $sequence each time, which might put you about the threshold.

        Dave.

Re: Memory Usage in Regex On Large Sequence
by zentara (Cardinal) on Sep 25, 2006 at 17:11 UTC
    Without being able to run your code, I would also check that the Bio::SeqIO objects in each thread are thread-safe. Maybe you could substitute some generic code in it's place and see if the threads run fine with just the file operations.

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum