Memory Usage in Regex On Large Sequence

bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:

Howdy again,

I've hit a memory-related conundrum, and am looking for a bit of advice. For background, this is broadly the same project that I asked about previously, first doing benchmarks to determine the fastest way to find motifs in a sequence, then getting help from BrowserUK on adding threading. Basically, the goal is to search complete genomic sequences and find all occurrences of a number of short motifs.

As the project has evolved, I found the need to allow for ambiguity in my match motifs. Thus the code I had working for exact matches needed to change. I'll give the entire script at the bottom, but here is the relevant function for exact matches:

So the key analysis code is this:

while ( ($pos = index($sequence, $motif, $pos)) >= 0 ) {
    lock $semaphore;
    print $fh join("\t", $genome_build, @row, $chr, $pos+1), "\n";
    $pos += $len;
     }
[download]

When I go ahead and try to replace that with a regular-expression based parser, I run out of memory. The code I'm replacing with is just:

while ( $sequence =~ /$motif/g ) {
        lock $semaphore;
        print $fh join("\t", $genome_build, @row, $chr, pos($sequence)
+), "\n";
        }
[download]

I should specify: when I say "run out of memory" I mean I literally get an Out of memory! message from the Perl interpreter as my script terminates prematurally. I'm very confused why the memory usage would be very different between these two. Any suggestions?

Here's the full output on termination:

Out of memory!
Callback called exit at find_and_load_motifs_test.pl line 150.
Thread 3 terminated abnormally: Callback called exit at find_and_load_
+motifs_test.pl line 150.
Perl exited with active threads:
        9 running and unjoined
        1 finished and unjoined
        0 running and detached
[download]

Here's the full code of the script, in case it helps:

### INCLUDES #########################################################
+################################
use strict;
use Bio::SeqIO;
use Carp;
use threads;
use threads::shared;

# turn off STDOUT buffering to allow smooth printing from multiple thr
+eads
$| = 1;

### PARAMETERS #######################################################
+################################
my $chr_file     = $ARGV[0];
my $motif_file   = $ARGV[1];
my $genome_build = $ARGV[2];
my $max_threads = 10;

if ( 3 != scalar(@ARGV) ) {
        croak 'Invalid parameter number';
        }
elsif ( ! -e $chr_file || ! -T $chr_file ) {
        croak 'Missing or invalid chromsome-listing file';
        }
elsif ( ! -e $motif_file || ! -T $motif_file ) {
        croak 'Missing or invalid motif-list file';
        }

my $out_file = 'results.out';

### LOCALS ###########################################################
+################################
my @chromosomes;
my %threads;
my $semaphore : shared;

### LOAD THE CHROMOSOME LIST #########################################
+################################
open(my $fh_chr, '<', $chr_file) or croak "Unable to open chromsome li
+st: $chr_file";

while (<$fh_chr>) {

        s/^\s+//;
        s/\s+$//;

        my $row = $_;
        next() if (!$row);

        push @chromosomes, $row;

        }

close($fh_chr);

### FUNCTION FOR THREADING ###########################################
+################################
sub threading_function($$$) {

        my $chromosome = $_[0];
        my $motif_file = $_[1];
        my $fh         = $_[2];

        # set up paths
        my $chr        = 'chr'.$chromosome;
        my $directory  = $chromosome.'/';
        my $fa_file    = $chr.'.fa.masked';
        my $gz_file    = $fa_file.'.gz';
        my $fa_path    = $directory.$fa_file;
        my $gz_path    = $directory.$gz_file;

        # decompress sequence file
        my $tst = `gunzip $gz_path`;

        # read sequence for motif-scanning
        my $seqio = Bio::SeqIO->new(
                -file    =>  "<$fa_path",
                -format  =>  'largefasta'
                );

        my $seq = $seqio->next_seq();
        my $sequence = $seq->seq();

        # open the motif file
        open(my $fh_motif, '<', $motif_file) or croak "Unable to open 
+motif file: $motif_file";

        # process each motif in turn
        while (<$fh_motif>) {

                s/^\s+//;
                s/\s+$//;

                my @row = split("\t");

                next() if ( 4 != scalar(@row) );

                my $motif = pop(@row);

                while ( $sequence =~ /$motif/g ) {
                        lock $semaphore;
                        print $fh join("\t", $genome_build, @row, $chr
+, pos($sequence)), "\n";
                        }

                }

        # close the motif file
        close($fh_motif);

        # recompress sequence file
        $tst = `gzip -9 $fa_path`;

        }

### THREAD MANAGEMENT ################################################
+#################################
# initially all chromosomes are waiting to be run
my @waiting = @chromosomes;

# open up the output file
open(my $fh_out, '>', $out_file) or croak 'Unable to open outfile';

# turn off buffering (make the file-handle "hot") to allow smooth prin
+ting from multiple threads
select((select($fh_out), $|=1)[0]);

# keep going while either chromosomes are waiting or threads are execu
+ting
while ( scalar(@waiting) || scalar( keys(%threads) ) ) {

        # Note: I could have used threads->list( threads::running ) to
+ count the running threads

        # first check if old threads are finished
        foreach my $chr ( keys(%threads) ) {

                # if a thread is finished join it and remove from %thr
+eads
                if ( $threads{$chr}->is_joinable() ) {
                        $threads{$chr}->join();
                        delete $threads{$chr};
                        }

                }

        # then check if we can start a new thread
        if ( ( scalar( keys(%threads) ) < $max_threads ) && ( scalar(@
+waiting) ) ) {
                my $chr = pop(@waiting);
                $threads{$chr} = threads->new(\&threading_function, $c
+hr, $motif_file, $fh_out);
                }

        # take a break if we are all full-up on threads to not waste C
+PU
        sleep 60 if ( scalar( keys(%threads) ) >= $max_threads );

        }
[download]

Comment on Memory Usage in Regex On Large Sequence Select or Download Code

Replies are listed 'Best First'.
Re: Memory Usage in Regex On Large Sequence by liverpole (Monsignor) on Sep 25, 2006 at 16:33 UTC
Hi bernanke01, My guess is that your problem may be related to the number of threads you're using. I can't tell from your program how many threads you are starting up (it looks like it's dependent on a data file), but I do know that, in situations where I've tried to create hundreds of threads, I often will get the same Out of memory! error. The reason that it's misleading is that it doesn't just depend on the number of threads, it also depends on the context, what each thread is doing, and at which time. That's why it can easily look like a separate issue. My recommendation would be to first try limiting the number of threads (unless it is already quite small), and see if the behavior changes at all. If that doesn't do it, even for a small quantity (eg. less than 20), then perhaps your problem really does lie elsewhere. s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply]
Re^2: Memory Usage in Regex On Large Sequence by bernanke01 (Beadle) on Sep 25, 2006 at 17:57 UTC
Hi, good question: it's hard-coded at 10 threads, but I've reduced it down to 3 and replicated the error. I'll set it running with just one and see what happens there.	[reply]
Re: Memory Usage in Regex On Large Sequence by BrowserUk (Patriarch) on Sep 25, 2006 at 16:50 UTC
You'll need to dave_the_m or someone similar to explain why the regex engine would consume more memory than index, even for an apparently similar, simple search, but it would not surprise me if it does. In the meantime, have you tried reducing `my $max_threads = 10;`? Did that make a difference? Other questions. Which OS are you using? Which version of Perl? If you are on Win32, could you post the output of `dumpbin /headers \path\to\your\perl.exe` [download] In particular, these lines: `1000000 size of stack reserve 1000 size of stack commit 100000 size of heap reserve 1000 size of heap commit` [download] See Use more threads. for why. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Memory Usage in Regex On Large Sequence by bernanke01 (Beadle) on Sep 25, 2006 at 18:00 UTC
Hi Browser! Reducing to `$max_threads = 3` didn't affect things, but I'll try going to 1 and see what happens there. For other things, this is perl 5.8.8 on AIX 5.2, but all has been replicated on 5.8.7 on AIX 5.3 as well.	[reply] [d/l]
Re: Memory Usage in Regex On Large Sequence by dave_the_m (Monsignor) on Sep 25, 2006 at 17:29 UTC
Questions: How many threads are you creating? what sorts of strings does $motif contain? Is it anything other than A/C/G/T x N ? If so, what? Otherwise, what's the maximum length of $motif? Dave.	[reply]
Re^2: Memory Usage in Regex On Large Sequence by bernanke01 (Beadle) on Sep 25, 2006 at 18:02 UTC
Hi, It's 10 threads, but repeats at 3 as well. For now, the motifs are just 6 strings of GCAT of 5 or 6 letters each. In theory I'd like to use more complex motifs of course. So, here are my current motif-list: `GCGTG GTGCG CACGC CGCAC CACGTG GTGCAC` [download]	[reply] [d/l]
Re^3: Memory Usage in Regex On Large Sequence by dave_the_m (Monsignor) on Sep 25, 2006 at 19:38 UTC
In that case, I can't see any good reason why replacing `index(..,"GTGCAC",...)` with `/GTGCAC/` should use any more memory, except that if `$sequence` is very large, and if somewhere in the program (eg from an included module), $`, $& or $' is used, then perl has to take a complete copy of `$sequence` each time, which might put you about the threshold. Dave.	[reply] [d/l] [select]
Re^4: Memory Usage in Regex On Large Sequence by bernanke01 (Beadle) on Sep 25, 2006 at 20:16 UTC
Re^5: Memory Usage in Regex On Large Sequence by BrowserUk (Patriarch) on Sep 26, 2006 at 00:19 UTC
Some notes below your chosen depth have not been shown here
Re: Memory Usage in Regex On Large Sequence by zentara (Cardinal) on Sep 25, 2006 at 17:11 UTC
Without being able to run your code, I would also check that the Bio::SeqIO objects in each thread are thread-safe. Maybe you could substitute some generic code in it's place and see if the threads run fine with just the file operations. I'm not really a human, but I play one on earth. Cogito ergo sum a bum	[reply]