in reply to Memory Growth Problem

I know next to nothing about proteins (except for the ones I like to eat), but based on other bio-related perl questions I've seen here (and what little I've read about protein sequences), I had the impression that your $sequence variable would consist of only 4 distinct letters (ACGT), so no matter how large a given sequence string happens to be, there could never be more than 4**5 (i.e. 1024) distinct 5-grams (and most sequences won't have that many).

So running out of memory by doing n-gram counts on protein sequences would mean that you are doing lots of sequences, and a newly created hash of ngram counts is somehow being retained after each one. Since I have had occasion to use Text::Ngram, I wanted to check this carefully.

Please let me know if the following test script somehow falls short in terms of representing your particular usage, because as it stands, it does not replicate the memory leak:

#!/usr/bin/perl use strict; use warnings; use Text::Ngram qw/ngram_counts/; $|++; my @p = qw/a c g t/; my $test_seq = join( '', map { $p[rand @p] } 0..2047 ); my $counter = 0; while ( 1 ) { my $href = ngram_counts( $test_seq ); my $ngrams = scalar keys %$href; if ( ++$counter % 100 == 0 ) { printf( "found %4d 5-grams on iteration # %8d\r", $ngrams, $co +unter ); $test_seq = join( '', map { $p[rand @p] } 0..2047 ); } }
(updated to include fixed-width numeric fields in the printf)

No matter how long I let that run, it stays at a constant memory footprint, suggesting that Text::Ngram by itself does not have a memory leak. (I let it go over 200K iterations, which ought to be equivalent to processing about 400 MB of data.)

You didn't indicate what your code looks like after you stopped using that module, but I'm wondering if there might have been some other factor at play in creating (and then fixing) the memory leak.

I notice that the current version of Text::Ngram seems to date from June 2006, so you probably have that version. If you run my test script and it blows up on your machine, then there's probably something wrong with your particular installation of Text::Ngram. (I just did a fresh install on macosx with perl 5.8.8.)

FWIW, I tried a variant of my test script, declaring an array outside the while loop and pushing the href onto the array at each iteration. The process grew to 1 GB of memory before it got to 36 K iterations. (Update: the version as posted used a constant 19 MB of RAM, about the same size as a login bash shell on my mac.)