Howdy again,

I've hit a memory-related conundrum, and am looking for a bit of advice. For background, this is broadly the same project that I asked about previously, first doing benchmarks to determine the fastest way to find motifs in a sequence, then getting help from BrowserUK on adding threading. Basically, the goal is to search complete genomic sequences and find all occurrences of a number of short motifs.

As the project has evolved, I found the need to allow for ambiguity in my match motifs. Thus the code I had working for exact matches needed to change. I'll give the entire script at the bottom, but here is the relevant function for exact matches:

So the key analysis code is this:

while ( ($pos = index($sequence, $motif, $pos)) >= 0 ) { lock $semaphore; print $fh join("\t", $genome_build, @row, $chr, $pos+1), "\n"; $pos += $len; }

When I go ahead and try to replace that with a regular-expression based parser, I run out of memory. The code I'm replacing with is just:

while ( $sequence =~ /$motif/g ) { lock $semaphore; print $fh join("\t", $genome_build, @row, $chr, pos($sequence) +), "\n"; }

I should specify: when I say "run out of memory" I mean I literally get an Out of memory! message from the Perl interpreter as my script terminates prematurally. I'm very confused why the memory usage would be very different between these two. Any suggestions?

Here's the full output on termination:

Out of memory! Callback called exit at find_and_load_motifs_test.pl line 150. Thread 3 terminated abnormally: Callback called exit at find_and_load_ +motifs_test.pl line 150. Perl exited with active threads: 9 running and unjoined 1 finished and unjoined 0 running and detached

Here's the full code of the script, in case it helps:

### INCLUDES ######################################################### +################################ use strict; use Bio::SeqIO; use Carp; use threads; use threads::shared; # turn off STDOUT buffering to allow smooth printing from multiple thr +eads $| = 1; ### PARAMETERS ####################################################### +################################ my $chr_file = $ARGV[0]; my $motif_file = $ARGV[1]; my $genome_build = $ARGV[2]; my $max_threads = 10; if ( 3 != scalar(@ARGV) ) { croak 'Invalid parameter number'; } elsif ( ! -e $chr_file || ! -T $chr_file ) { croak 'Missing or invalid chromsome-listing file'; } elsif ( ! -e $motif_file || ! -T $motif_file ) { croak 'Missing or invalid motif-list file'; } my $out_file = 'results.out'; ### LOCALS ########################################################### +################################ my @chromosomes; my %threads; my $semaphore : shared; ### LOAD THE CHROMOSOME LIST ######################################### +################################ open(my $fh_chr, '<', $chr_file) or croak "Unable to open chromsome li +st: $chr_file"; while (<$fh_chr>) { s/^\s+//; s/\s+$//; my $row = $_; next() if (!$row); push @chromosomes, $row; } close($fh_chr); ### FUNCTION FOR THREADING ########################################### +################################ sub threading_function($$$) { my $chromosome = $_[0]; my $motif_file = $_[1]; my $fh = $_[2]; # set up paths my $chr = 'chr'.$chromosome; my $directory = $chromosome.'/'; my $fa_file = $chr.'.fa.masked'; my $gz_file = $fa_file.'.gz'; my $fa_path = $directory.$fa_file; my $gz_path = $directory.$gz_file; # decompress sequence file my $tst = `gunzip $gz_path`; # read sequence for motif-scanning my $seqio = Bio::SeqIO->new( -file => "<$fa_path", -format => 'largefasta' ); my $seq = $seqio->next_seq(); my $sequence = $seq->seq(); # open the motif file open(my $fh_motif, '<', $motif_file) or croak "Unable to open +motif file: $motif_file"; # process each motif in turn while (<$fh_motif>) { s/^\s+//; s/\s+$//; my @row = split("\t"); next() if ( 4 != scalar(@row) ); my $motif = pop(@row); while ( $sequence =~ /$motif/g ) { lock $semaphore; print $fh join("\t", $genome_build, @row, $chr +, pos($sequence)), "\n"; } } # close the motif file close($fh_motif); # recompress sequence file $tst = `gzip -9 $fa_path`; } ### THREAD MANAGEMENT ################################################ +################################# # initially all chromosomes are waiting to be run my @waiting = @chromosomes; # open up the output file open(my $fh_out, '>', $out_file) or croak 'Unable to open outfile'; # turn off buffering (make the file-handle "hot") to allow smooth prin +ting from multiple threads select((select($fh_out), $|=1)[0]); # keep going while either chromosomes are waiting or threads are execu +ting while ( scalar(@waiting) || scalar( keys(%threads) ) ) { # Note: I could have used threads->list( threads::running ) to + count the running threads # first check if old threads are finished foreach my $chr ( keys(%threads) ) { # if a thread is finished join it and remove from %thr +eads if ( $threads{$chr}->is_joinable() ) { $threads{$chr}->join(); delete $threads{$chr}; } } # then check if we can start a new thread if ( ( scalar( keys(%threads) ) < $max_threads ) && ( scalar(@ +waiting) ) ) { my $chr = pop(@waiting); $threads{$chr} = threads->new(\&threading_function, $c +hr, $motif_file, $fh_out); } # take a break if we are all full-up on threads to not waste C +PU sleep 60 if ( scalar( keys(%threads) ) >= $max_threads ); }

In reply to Memory Usage in Regex On Large Sequence by bernanke01

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.