cav has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,
I study bioinformatics, and presently I wrote a parser to study the genome of T. Cruzi. What this simply does, is to open a 1GB file, and go line by line through all the "words" of the file and write down their location and number of insertion in a mysql database.

The problem is that while the program runs, the ram keeps going up, when it shouldnt. Since I expect this program to run during 7 days, before obtaining my results, what should I do? Below follows the code where the RAM keeps pumping up.

Thanks
Paulo Carvalho
LONG:while (<READ>) { $jump ++; if ($jump <= $rows2jump){ $counter2++; next LONG; } $counter2++; #number of lines processed #Lets deal with the sequences head here if ($_ =~ /^>/) { $h_title = "$_"; $h_title =~ s/^>//; #Check if head exists #If head doesnt exist, insert it in the <br>head index + table $sth1->execute($h_title) or <br>print "$DBI::errstr"; #Get last id $h_title_counter = $dbh->selectrow_array("SELECT MAX(i +d) FROM head_index"); next LONG; } #Here we will deal with the actal sequence chop; #To remove the > at the end of the line my $line = "$_"; #Input stuff in mysql for my $x (0 ... (length($line) - ($word_length)-1)) { my $word = substr($line, $x, $word_length); <br> #Now we have it with a big cache, hope to speed things + up if ($most_used_words{$word}) { #UPDATE e_match_word set counter = counter+1 WHERE + word = ? $sth3->execute($word) or print "$DBI::errstr"; #print "cached result\n"; } elsif ($dbh->selectrow_array("SELECT 1 FROM e_match_ +word WHERE word = ?", undef, $word)) { #UPDATE e_match_word set counter = counter+1 WHERE + word = ? $sth3->execute($word) or print "$DBI::errstr"; #print "not cached result\n"; } else { #INSERT INTO e_match_word VALUES (?,?) $sth2->execute($word, 1) or print "$DBI::errstr"; $wordcounter++; #print "new word\n"; } #INSERT INTO e_match_info values (?,?,?) $sth4->execute($word, $h_title_counter, $x) or print " +$DBI::errstr"; } #$percent_done = int(($counter2/$counter) * 100); $status = "Searching for exact patterns.. processing line +$counter2 of $counter"; $mw->update(); $mw->update; exit if ($stop); } close (READ); }

Replies are listed 'Best First'.
Re: Memory Overflow
by tachyon (Chancellor) on Jan 05, 2004 at 23:50 UTC

    I suspect that the $dbh->fetchrow_array() statements may be causing you issue, also you had a very significant syntax error that would have led to $h_title_counter ALWAYS being set to 1. Vis:

    my @array = ( 'single result value' ); my $wrong = @array; # SCALAR CONTEXT my ($right) = @array; # ARRAY CONTEXT print "Wrong: $wrong\nRight: $right\n"; __DATA__ Wrong: 1 Right: single result value

    I have cleaned up the code a little, moved stuff out of the loop that does not need to be there and added a bind case and dropped the $line declaration and just use $_ as you only use it once, and Perl has lazy GC. $jump will always equal $. (line count - so that is one less var)

    # we don't need this in the loop as we know when we increment (unles +s forking) # also syntax IS WRONG in original - need array context on LHS to ge +t val # otherwise you are always going to get the val 1 ($h_title_counter) = $dbh->selectrow_array("SELECT MAX(id) FROM head +_index"); # this should be a sth with bind not $dbh->selectrow_array every tim +e my $sthX = $dbh->prepare("SELECT 1 FROM e_match_word WHERE word = ?" +); LONG: while (<READ>) { next if $. <= $rows2jump; $counter2++; #number of lines processed #Lets deal with the sequences head here, marked by > so th +is will # only exec if we have that marker if ( s/^>// ) { $sth1->execute($_) or print "$DBI::errstr"; $h_title_counter++; next LONG; } #Here we will deal with the actal sequence chop; #To remove the > at the end of the line #Input stuff in mysql for my $x (0 ... (length($line) - ($word_length)-1)) { my $word = substr($_, $x, $word_length); #Now we have it with a big cache, hope to speed things + up if ($most_used_words{$word}) { #UPDATE e_match_word set counter = counter+1 WHERE + word = ? $sth3->execute($word) or print "$DBI::errstr"; #print "cached result\n"; } elsif ( $sthX->execute($word)) { #UPDATE e_match_word set counter = counter+1 WHERE + word = ? $sth3->execute($word) or print "$DBI::errstr"; #print "not cached result\n"; } else { #INSERT INTO e_match_word VALUES (?,?) $sth2->execute($word, 1) or print "$DBI::errstr"; $wordcounter++; #print "new word\n"; } #INSERT INTO e_match_info values (?,?,?) $sth4->execute($word, $h_title_counter, $x) or print " +$DBI::errstr"; } #$percent_done = int(($counter2/$counter) * 100); # you are setting status but never use it...... #$status = "Searching for exact patterns.. processing line + $counter2 of $counter"; $mw->update(); # $mw->update; <--- duplicate # exit if ($stop); <--- never set in loop } close (READ);

    Now the other issue you have is runtime. If you are say using MySQL with default settings you may error out with a MySQL server has gone away error unless you are using all you prepared STHs regularly ie within timeout (8 hours default). Probably not an issue but it would be a bit annoying if this died 6.9 days into a run.

    Finally I can save you another database query. You can do this (pseudo code)

    # currently if ( DB query word exists in table ) { DB do update counter } else DB insert word with counter val of 1 } # this will always use 2 DB queries per iter, whereas this code..... unless ( execute(UPDATE table SET counter = counter + 1 WHERE word = ? +) ) { # this should only fail if the bound word does not exist so # know we know we really needed an INSERT with default value (1) # but we have saved 1 DB query - IN A LOOP execute(INSERT INTO table VALUES ( word, 1 ) ) } # will only use one query if it is an update (probably the most common + case) # and still only uses 2 for an insert. # this can yield savings -> 50% depending on Update:Insert ratio

    The logic is that if we need the insert the update will fail. If not the update succeeds. Thus we save directly verifying if 'word' is in the table or not.

    cheers

    tachyon

      my ($right) = @array; # ARRAY CONTEXT ?

      I think you mean LIST CONTEXT.

        I see your quible and raise you.....

        cheers

        tachyon

Re: Memory Overflow
by edan (Curate) on Jan 06, 2004 at 07:19 UTC

    A long shot, but...

    What version of DBD::mysql are you using? If you don't know how to check, do this at your nearest command prompt:

    perl -MDBD::mysql -le'print $DBD::mysql::VERSION'

    I recently had a problem with a memory leak, which was traced to DBD::mysql version 2.9002. I installed 2.9003 and the problem went away.

    --
    3dan

Re: Memory Overflow
by Joost (Canon) on Jan 06, 2004 at 00:02 UTC