mr.nick has asked for the wisdom of the Perl Monks concerning the following question:

Okay ---

After struggling to reduce the memory usage of a statistics program of mine; after blaming everything from tie to NDBM_File to DBI, I've finally realized something: my program consumes vast quantities of memory without actually storing anything.

What do I mean? Well, look at the following code and tell me: why does it grow to hundreds of megabytes of memory? I'm not storing any data any place.

#!/usr/bin/perl use strict; use CGI; use URI::Escape; ###################################################################### +######## # # ###################################################################### +######## if ($ARGV[0]=~/\.gz$/i) { open STDIN,"zcat $ARGV[0] |"; shift @ARGV; } ###################################################################### +######## ## sub breakquery { my $sz=shift; my %res; $sz="\L$sz"; $sz=~s/\.[a-z]{2}\.//g; $sz=~s/\@[a-z]{2}.\d+//g; while ($sz=~s/[\"\']([^\"\']+?)[\"\']//) { $res{$1}++; } $sz=~s/\s{2}/ /g; $sz=~s/[\+\'\"\$\(\)]//g; if ($sz) { my @terms=split /[\s,]/,$sz; for my $t (@terms) { $t=~s/^\s+//; $t=~s/\s+$//; next if $t=~/^\s*$/ || $t=~/^..{0,1}$/ || $t!~/^[a-z0-9\-]+$/ || + $t=~/^[0-9]+$/; $res{$t}++ unless grep /^$t$/,qw( and not or adj of the for with + ); } } sort keys %res; } ###################################################################### +######## while (<>) { next unless m{GET /netacgi/nph-brs\?([^\s]+)}; my $cgi=new CGI($1); next unless defined $cgi; my $db=$cgi->param("d"); next if $db=~/^\s*$/; $db="\U$db"; next if grep /^$db$/,qw( CHNH CHCA ); next if length($db)!=4; my $s4=$cgi->param("s4"); next unless defined $s4; next if $s4=~/^\s*$/; my @terms=breakquery $s4; print STDERR "\r"; printf STDERR "%.100s","$db ".join(" ",@terms); # $db $t }
Do you see anyplace where any data is being stored beyond the one-line-at-a-time level? There are no globals and certainly no variables outside the while (<>) or subroutines.

So why does this program grow in size when run? Like I said, I'm not accumulating data :( And within 26 minutes or so of running it exceeds 200MB of memory. After 50 minutes, it consumes 400MB.

Am I missing something really basic here?

Btw, sample input data looks like:

anx57-105.dialup.emory.edu - - [01/Mar/2001:00:00:21 -0500] "GET /deta +il/detail.html HTTP/1.0" 200 19308 "http://chid.nih.gov/netacgi/nph-b +rs?op4=and&op5=and&op6=and&op7=and&op8=and&op9=and&op10=and&d=CHCP&l= +20&Sect1=CINK&co3=and&pg4=all&s4=underserved&co4=and&pg5=mj&s5=cervic +al+cancer&co5=and&pg6=de&s6=&co6=and&pg7=au,cn&s7=&co7=and&pg8=ti&s8= +&co8=and&pg9=ac&s9=&co9=and&pg10=so,av&s10=&s1=@YR%3E=1995+or+199X.&c +o1=and&s3=&co2=and&s2=&Sect2=IMAGE&Sect3=THESOFF&Sect3=PLUROFF&Sect4= +HITOFF&p=1&u=/detail/detail.html&r=8&f=G" "Mozilla/4.73 [en] (Win95; +U)"

Replies are listed 'Best First'.
Re: Memory Leaks
by mr.nick (Chaplain) on Apr 11, 2001 at 23:34 UTC
    I suppose I should comment here:

    The problem was identified to be CGI.pm not cleaning up after itself when I used the $cgi=new CGI($1) line (which initializes the $cgi object to values contained in the $1 string).

    At the end of the while loop, CGI did not properly clean up after itself. My final solution was to scrap the use of CGI.pm and write my own parameter extractor (since I only need the "d" and "s4" parameters).

    Easy enough and magnitudes faster.


    And some valid test data:

    126.san-juan-01-02rs.pr.dial-access.att.net - - [01/Mar/2001:00:00:57 +-0500] "GET /netacgi/nph-brs?d=NKKU&op4=and&s4=%22chronic+rena l+insufficiecy%22&l=20&Sect1=LINK&Sect2=IMAGE&Sect3=HITOFF&p=1&u=%2Fsi +mple%2Fsimple.html&r=0&f=S HTTP/1.1" 200 - "http://chid.nih.go v/simple/simple.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; +AT&T WNS IE4.0)"