Robgunn has asked for the wisdom of the Perl Monks concerning the following question:

Ok here is the background: I'm slurping in data from a file which contains protein sequences in FASTA format. I am then looping through and pulling out each protein header and its sequence and then generating some n-gram data and storing it into a BerkeleyDB file. In case you are wondering I am using an XS-based module called Text::Ngram to do the n-gram generation. It returns a reference to a hash containing the n-grams and their frequencies. Here is a code snippet...
use Text::Ngram qw( ngram_counts ); use ... use ... my $db = "/home/db/repository/megalith.db"; my $ngram_width = 5; my %protein_search_hash; unlink $db; tie %protein_search_hash, 'MLDBM', -Filename => $db, -Flags => DB_CREATE | DB_INIT_LO +CK or die "Cannot open database '$db: $!\n"; open FILE, "<test_data_2MB.txt" or die "Failed to load test_data: $!\n"; my $fasta_sequence = do { local $/; <FILE> }; close(FILE); # Precompile regex for speed my $regex = qr/>([\S]*)\s*.*\s([A-Za-z\s]+)/; my $regex_d = qr/^>[\S]*\s*.*\s[A-Za-z\s]+/; # PARSE $fasta_sequence and split header from the sequence while( $fasta_sequence =~ m/$regex/igm ) { my $sequence = $2; my $header = $1; $sequence =~ tr/[\r|\n]//d; # delete all line breaks # generate n-grams my $ngram_href= ngram_counts($sequence, $ngram_width); $protein_search_hash{$header} = $ngram_href; } ... ... ...
My problem is with memory usage: it continually gobbles RAM. Even if I comment out
$protein_search_hash{$header} = $ngram_href;
...it still eats the same amount of RAM each time. However, I have narrowed it down to
my $ngram_href= ngram_counts($sequence, $ngram_width);
...as being the problem. I've tried undef and putting it in a subroutine but nothing stops the insatiable thirst for RAM. Currently, I'm limited to processing data files under 13MB (anything higher and I'm out of RAM and into swap). If I comment out the above line, RAM usage flatlines like I would expect. Why isn't '$ngram_href' being overwritten each iteration through and the old value being GC'd? Monks, what am I missing here? Any help would be much appreciated!

Replies are listed 'Best First'.
Re: Memory Growth Problem
by almut (Canon) on Nov 08, 2008 at 04:16 UTC

    Presumably you have circular references.  In short: Perl's garbage collection mechanism is based on counting references to variables/objects. When some data structure is referring back to itself, the reference count doesn't drop to zero when it goes out of scope (or is being undef'ed), and the memory isn't being freed.

    There are a couple of modules for analysing and getting rid of the problem, e.g. Devel::Leak, Devel::LeakTrace, Devel::Cycle, Object::Destroyer, Devel::Monitor. (The docs of the latter module contain a rather elaborate explanation of the problem, btw.)

Re: Memory Growth Problem
by jwkrahn (Abbot) on Nov 08, 2008 at 07:52 UTC
    my $regex = qr/>([\S]*)\s*.*\s([A-Za-z\s]+)/; ... while( $fasta_sequence =~ m/$regex/igm ) {

    Why are you putting the \S character class inside a character class?   Why use the /m option when you are not using either ^ or $ in the pattern?   Why use the /i option?

    $sequence =~ tr/[\r|\n]//d; # delete all line breaks

    Why are you also deleting the [, | and ] characters?

      Whoops, thanks for pointing that out.

      Update: Well, like others have pointed out, it was indeed a leak in the module I was using. I decided roll my own subroutine and my memory growth problems are solved. To my surprise I'm still hitting my Protein/second target! I'm back to loving Perl.

      Thanks for all the help!

Re: Memory Growth Problem
by kyle (Abbot) on Nov 08, 2008 at 04:10 UTC

    Maybe ngram_counts is leaking memory.

    As long as you store $ngram_href somewhere, I wouldn't expect it to get garbage collected.

Re: Memory Growth Problem
by graff (Chancellor) on Nov 10, 2008 at 01:28 UTC
    I know next to nothing about proteins (except for the ones I like to eat), but based on other bio-related perl questions I've seen here (and what little I've read about protein sequences), I had the impression that your $sequence variable would consist of only 4 distinct letters (ACGT), so no matter how large a given sequence string happens to be, there could never be more than 4**5 (i.e. 1024) distinct 5-grams (and most sequences won't have that many).

    So running out of memory by doing n-gram counts on protein sequences would mean that you are doing lots of sequences, and a newly created hash of ngram counts is somehow being retained after each one. Since I have had occasion to use Text::Ngram, I wanted to check this carefully.

    Please let me know if the following test script somehow falls short in terms of representing your particular usage, because as it stands, it does not replicate the memory leak:

    #!/usr/bin/perl use strict; use warnings; use Text::Ngram qw/ngram_counts/; $|++; my @p = qw/a c g t/; my $test_seq = join( '', map { $p[rand @p] } 0..2047 ); my $counter = 0; while ( 1 ) { my $href = ngram_counts( $test_seq ); my $ngrams = scalar keys %$href; if ( ++$counter % 100 == 0 ) { printf( "found %4d 5-grams on iteration # %8d\r", $ngrams, $co +unter ); $test_seq = join( '', map { $p[rand @p] } 0..2047 ); } }
    (updated to include fixed-width numeric fields in the printf)

    No matter how long I let that run, it stays at a constant memory footprint, suggesting that Text::Ngram by itself does not have a memory leak. (I let it go over 200K iterations, which ought to be equivalent to processing about 400 MB of data.)

    You didn't indicate what your code looks like after you stopped using that module, but I'm wondering if there might have been some other factor at play in creating (and then fixing) the memory leak.

    I notice that the current version of Text::Ngram seems to date from June 2006, so you probably have that version. If you run my test script and it blows up on your machine, then there's probably something wrong with your particular installation of Text::Ngram. (I just did a fresh install on macosx with perl 5.8.8.)

    FWIW, I tried a variant of my test script, declaring an array outside the while loop and pushing the href onto the array at each iteration. The process grew to 1 GB of memory before it got to 36 K iterations. (Update: the version as posted used a constant 19 MB of RAM, about the same size as a login bash shell on my mac.)

Re: Memory Growth Problem
by Anonymous Monk on Nov 08, 2008 at 04:17 UTC
    It sounds like there is a bug in ngram_counts