Crawling with Parallel::ForkManager

listanand has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I am trying to do using Parallel::ForkManager to do some crawling. I am new to Perl and am having some trouble with the crawling process. Here's the relevant piece of code

    my $manager = Parallel::ForkManager->new(4);
    for(@identifier){ #List of URLs to be crawled
        $manager->start and next;

        $mech->get($_);
        die $mech->response->status_line unless $mech->success;

        my $html = $mech->content;

        ##some processing of HTML to extract the location of PDF file#
+# 

        mirror($url,"/home/username/data/$file_name.pdf");

        $manager->finish;
        sleep(2)
    }
    $manager->wait_all_children;
[download]

Now what happens when I run this program is that some PDF files are being retrieved and I notice an error message (" Error GETing URL_NAME: Service Temporarily Unavailable at crawl.pl line 138"). But in fact, the URL_NAME is accessible when I use a browser to view it. There are plenty of URLs not being crawled because of this.

What am I missing?

Thanks in advance.

Comment on Crawling with Parallel::ForkManager Download Code

Replies are listed 'Best First'.
Re: Crawling with Parallel::ForkManager by fullermd (Vicar) on Aug 07, 2009 at 21:53 UTC
Just because it's there when you try to load it at a different time, doesn't mean it was really available when the script ran. Specifically, "Service Temporarily Unavailable" suggests the server refusing your connection because it thinks you've already got enough (too many) connections open to it, which is one of the things to watch out for when you do big parallel fetches. Try reducing the amount of parallelism and see if it happens less often.	[reply]
Re^2: Crawling with Parallel::ForkManager by listanand (Sexton) on Aug 07, 2009 at 22:28 UTC
Thanks for writing. Well I try to access the webpages right after I stop (terminate, in this case) the program and not much later. You are right, when I spawn 3 child processes (I have 4 right now), in that case I see much less error messages. But even if I reduce it to 2 parallel connections, I still see error messages ! I can't think of a way out.	[reply]
Re^3: Crawling with Parallel::ForkManager by fullermd (Vicar) on Aug 07, 2009 at 22:44 UTC
It really just depends on why the server is giving you the cold shoulder. I went with the most obvious; number of simultaneous connections. If that's the case, dropping to 1 (i.e., not parallel at all) would resolve it. But it may do rate-limiting, shoving you away after a given number of responses in a particular time period. It may be server load dependent. It may just be flat-out random. Likely, the only way you can find out for sure what's up is by talking to the server admin. The best solution code-wise is to be adaptive; if you start getting errors, slow down, if you get no errors for a while, speed up. But that's a lot of work to get right.	[reply]
Re: Crawling with Parallel::ForkManager by bichonfrise74 (Vicar) on Aug 07, 2009 at 21:53 UTC
I assume you are using Mechanize to retrieve the pdf files. Have you tried to retrieve them without using the Parallel module?	[reply]
Re^2: Crawling with Parallel::ForkManager by listanand (Sexton) on Aug 07, 2009 at 22:34 UTC
Thanks for writing. I am using LWP::Simple (mirror method) to retrieve the PDFs. Without using the Parallel, everything works fine.	[reply]
Re^3: Crawling with Parallel::ForkManager by tokpela (Chaplain) on Aug 08, 2009 at 08:55 UTC
Just a guess here... Have you tried to download the PDF using the $mech connection you are already using? Say using: `$mech->get($url_to_pdf); $mech->save_content( $filename );` [download] Maybe this is a cookie issue. I believe that $mech will accept cookies by default. This might mean that using a separate mirror process causes a different connection to take place and the web server maybe does not allow a direct connection from that page without a cookie. It might work for you in the browser since your browser would already have a cookie.	[reply] [d/l]