Sorry if the code is a little long. Some of you may recognize it as a modified routine that Merlyn wrote some time ago. What is happening is that memory usage is getting eaten up and the program is killing itself before terminating when I have a particularly large web site I want to extract all the links for. I have tried a few things, but frankly I am not even sure where to start. Could this be a memory leak in HTML::Parser or Perl 5.6.1? I am not sure it is because at several points the memory usage goes down slightly which suggested that it is not really a leak.
Here is the code:
sub extractLinks { my $ua = new LWP::UserAgent; my @return; my %skippedurl; my %badurl; my %goodurl; my %links; my $badurls = 0; my $goodurls = 0; my $thisurl; my %did; my $response; my $request; my $method; my $startPoint = shift || die "__FILE__: No starting point set.\n" +; my $debug = shift; push (@CHECK, "$startPoint"); $domain = new URI($startPoint); if ($startPoint =~ /^.*tp/) { $domain = $domain->host; } else { return 0; } $ua->agent("LinkExtractor/1.0"); $ua->env_proxy; $| = 1; { MAINLOOP: while ($thisurl = shift @CHECK) { my $testnum = scalar(@CHECK); print "CHECK: $testnum\n"; if ($debug) { print "processing $thisurl, $domain\n"; } $thisurl =~ s/%7e/~/ig; # convert encoded tilde to usable tilde next if $did{$thisurl}++; if (PARSE($thisurl,$domain)) { $goodurl{$thisurl}++; if ($debug) { print "GOODURL: $thisurl\n"; } if ($debug) { warn "fetching $thisurl\n"; } $request = new HTTP::Request('GET',$thisurl); $response = $ua->request($request); # fetch! unless ($response->is_success) { $badurl{$thisurl} = " status ,$response->code, $response->me +ssage,"; $badurls = 1; if ($debug) { warn "Cannot fetch $thisurl (status ", $response->code, " ", $response->message,")\n"; } next MAINLOOP; } unless ($response->content_type =~ /text\/html/i) { next MAINLOOP; } $base = $response->base; my $p = ParseLink->new; $p->parse($response->content); # parse calls for a passing of undef but triggered a warn # so passed an empty string instead to get rid of the warnin +g msg $p->parse(""); # parse calls for a passing of undef but tri +ggered a warn for $link ($p->get_links) { $abs = url($link, $base)->abs; if ($debug) { warn "... $link => $abs\n"; } if (PARSE($abs,$domain)) { print "pushing: $abs\n"; push(@CHECK, $abs); push(@{ $links{$thisurl} }, $abs); } } } next MAINLOOP; if (PING $thisurl) { if ($debug) { warn "verifying $thisurl\n"; } for $method (qw(HEAD GET)) { $request = new HTTP::Request($method,$thisurl); $response = $ua->request($request); # fetch! if ($response->is_success) { next MAINLOOP; } } if ($debug) { warn "Cannot fetch $thisurl (status ", $response->code, " ", $response->message,")\n"; } $badurl{$thisurl} = "status ,$response->code, $response->mes +sage"; $badurls = 1; next MAINLOOP; } if ($debug) { warn "[skipping $thisurl]\n"; } $skippedurl{$thisurl} = 1; } push(@return, \%links); push(@return, \%goodurl); push(@return, \%badurl); return \@return; } }
Edit kudra, 2002-05-10 Added READMORE
In reply to Link Extrator Eating Up All My Memory by jonjacobmoon
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |