comment on

Been awhile. I have been away working on non-perl projects but I am back with a problem that has plagued me for a week, so I am at the point where I need a new set of eyes. Perhaps I am missing something obivous.

Sorry if the code is a little long. Some of you may recognize it as a modified routine that Merlyn wrote some time ago. What is happening is that memory usage is getting eaten up and the program is killing itself before terminating when I have a particularly large web site I want to extract all the links for. I have tried a few things, but frankly I am not even sure where to start. Could this be a memory leak in HTML::Parser or Perl 5.6.1? I am not sure it is because at several points the memory usage goes down slightly which suggested that it is not really a leak.

Here is the code:

sub extractLinks
  {
    my $ua = new LWP::UserAgent;
    my @return;
    my %skippedurl;
    my %badurl;
    my %goodurl;
    my %links;
    my $badurls = 0;
    my $goodurls = 0;
    my $thisurl;
    my %did;
    my $response;
    my $request;
    my $method;

    my $startPoint = shift || die "__FILE__: No starting point set.\n"
+;
    my $debug = shift;

    push (@CHECK, "$startPoint");

    $domain = new URI($startPoint);

    if ($startPoint =~ /^.*tp/)
      {
    $domain = $domain->host;
      }
    else
      {
    return 0;
      }

    $ua->agent("LinkExtractor/1.0");
    $ua->env_proxy;

    $| = 1;

    {
    MAINLOOP:
      while ($thisurl = shift @CHECK)
    {
      my $testnum = scalar(@CHECK);
      print "CHECK: $testnum\n";

      if ($debug)
        {
          print "processing $thisurl, $domain\n";
        }
      $thisurl =~ s/%7e/~/ig; # convert encoded tilde to usable tilde
      next if $did{$thisurl}++;
      if (PARSE($thisurl,$domain))
        {
          $goodurl{$thisurl}++;
          if ($debug)
        {
          print "GOODURL: $thisurl\n";
        }
          if ($debug)
        {
          warn "fetching $thisurl\n";
        }
          $request = new HTTP::Request('GET',$thisurl);
          $response = $ua->request($request); # fetch!

          unless ($response->is_success)
        {
          $badurl{$thisurl} = " status ,$response->code, $response->me
+ssage,";
          $badurls = 1;
          if ($debug)
            {
              warn
            "Cannot fetch $thisurl (status ",
              $response->code, " ", $response->message,")\n";
            }
          next MAINLOOP;
        }
    
          unless ($response->content_type =~ /text\/html/i)
        {
          next MAINLOOP;
        }
    
          $base = $response->base;
          my $p = ParseLink->new;
          $p->parse($response->content);
    
          # parse calls for a passing of undef but triggered a warn
          # so passed an empty string instead to get rid of the warnin
+g msg
          $p->parse("");  # parse calls for a passing of undef but tri
+ggered a warn
    
          for $link ($p->get_links)
        {
          $abs = url($link, $base)->abs;
          if ($debug)
            {
              warn "... $link => $abs\n";
            }
          if (PARSE($abs,$domain))
            {
              print "pushing: $abs\n";
              push(@CHECK, $abs);
              push(@{ $links{$thisurl} }, $abs);
            }
        }
        }
      next MAINLOOP;
    

      if (PING $thisurl)
        {    
          if ($debug)
        {
          warn "verifying $thisurl\n";
        }
          for $method (qw(HEAD GET))
        {
          $request = new HTTP::Request($method,$thisurl);
          $response = $ua->request($request); # fetch!
          if ($response->is_success)
            {
              next MAINLOOP;
            }
        }
    
          if ($debug)
        {
          warn
            "Cannot fetch $thisurl (status ",
              $response->code, " ", $response->message,")\n";
        }
          $badurl{$thisurl} = "status ,$response->code, $response->mes
+sage";
          $badurls = 1;
          next MAINLOOP;
        }

      if ($debug)
        {
          warn "[skipping $thisurl]\n";
        }
      $skippedurl{$thisurl} = 1;
    }

      push(@return, \%links);
      push(@return, \%goodurl);
      push(@return, \%badurl);
      return \@return;
    }
  }
[download]

I admit it, I am Paco.

Edit kudra, 2002-05-10 Added READMORE

In reply to Link Extrator Eating Up All My Memory by jonjacobmoon

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.