listanand has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to do using Parallel::ForkManager to do some crawling. I am new to Perl and am having some trouble with the crawling process. Here's the relevant piece of code
my $manager = Parallel::ForkManager->new(4); for(@identifier){ #List of URLs to be crawled $manager->start and next; $mech->get($_); die $mech->response->status_line unless $mech->success; my $html = $mech->content; ##some processing of HTML to extract the location of PDF file# +# mirror($url,"/home/username/data/$file_name.pdf"); $manager->finish; sleep(2) } $manager->wait_all_children;
Now what happens when I run this program is that some PDF files are being retrieved and I notice an error message (" Error GETing URL_NAME: Service Temporarily Unavailable at crawl.pl line 138"). But in fact, the URL_NAME is accessible when I use a browser to view it. There are plenty of URLs not being crawled because of this.
What am I missing?
Thanks in advance.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Crawling with Parallel::ForkManager
by fullermd (Vicar) on Aug 07, 2009 at 21:53 UTC | |
by listanand (Sexton) on Aug 07, 2009 at 22:28 UTC | |
by fullermd (Vicar) on Aug 07, 2009 at 22:44 UTC | |
|
Re: Crawling with Parallel::ForkManager
by bichonfrise74 (Vicar) on Aug 07, 2009 at 21:53 UTC | |
by listanand (Sexton) on Aug 07, 2009 at 22:34 UTC | |
by tokpela (Chaplain) on Aug 08, 2009 at 08:55 UTC |