Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

So I wrote a very simple DMOZ parser and downloader. Im running an array of 4380 URLs which get downloaded, the html stripped, then it discards the site and moves on. Its a proof of concept to see how long it will take to download a massive amount of sites using LWP and a couple other modules. After about 2200 URLs I get a seg fault. This is the code
#!/usr/bin/perl require LWP::UserAgent; use HTML::FormatText; use Time::HiRes; require HTML::TreeBuilder; print "Opening fetch file...\n"; #Open the DMOZ file open(tmp, "Top/Arts/Movies/domains"); @domains = <tmp>; close(tmp); #set variables used to output later $count=0; $interval=1; foreach(@domains) { $count++; $domain = "http://" . $_; $start = Time::HiRes::time(); my $ua = LWP::UserAgent->new; $ua->timeout(10); $ua->env_proxy; my $response = $ua->get($domain); if ($response->is_success) { $content = $response->content; # or whatever } else { $response->status_line; } $tree = HTML::TreeBuilder->new->parse_file("file.html"); $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => + 50); $end = Time::HiRes::time(); $time = $end-$start; $total += $time; if($total >=(10*$interval) ) { print "$count sites in $total seconds\n"; $interval++; } }
Any suggestions as to why this is seg faulting?

Replies are listed 'Best First'.
Re: Segment Fault(Core Dump)
by hv (Prior) on May 17, 2004 at 16:30 UTC

    Other respondents have suggested that you may be running out of memory, but that shouldn't normally cause a coredump. Coredumps tend to be quite specific to platform and perl version, so a useful starting point would be to show the output of perl -V.

    Of the modules you are using, the most system-sensitive is actually Time::HiRes, so it might be worth checking that you have the latest version installed.

    Since you presumably have a core file, the next useful thing to do would be to show it to a C-level debugger and ask for a backtrace. However the usefulness of the information may be limited if your perl isn't built with debugging symbols; in that case it might be worth trying to build a perl with debugging enabled and trying again.

    Hugo

Re: Segment Fault(Core Dump)
by zentara (Cardinal) on May 17, 2004 at 14:14 UTC
    I havn't run your code, but from my experiences with Tk programs I'll hazard a guess.

    You are creating a new $ua ,$tree, and $formatter object for each cycle thru the loop, so maybe you are running out of memory? Have you monitored the memory usage as it runs? Is there a way for you to create the $ua ,$tree, and $formatter only once before you start the loop, then reuse them as you loop thru the data?

    Maybe try undef the objects at the end of the loop, as a start.


    I'm not really a human, but I play one on earth. flash japh

      Actually since he's reassigning to the same $tree et al each time through it the old ones (if any) should be getting deallocated. Problem is that HTML::TreeBuilder objects contain circular references and don't get garbage collected correctly. This is why the perldocs show that you're supposed to call $tree->delete when you're done with it.

        Thanks everyone for replying. I'll try using "undef" and $tree->delete at the end of the main loop to clear the garbage up. Im suprised I didn't think of this sooner Cheers! Annonymous Monk
      Don't think this is a Tk program; I see no Tk modules here.

      Any chance the OP could get dbx to produce a stack trace from the original core dump. This would help identify where the problem is.

      --
      I'm Not Just Another Perl Hacker