retrieving multiple web documents

Ntav has asked for the wisdom of the Perl Monks concerning the following question:

I'm doing what I guess is a fairly common task with perl: retrieving a number of data sources from the web, parsing, extracting and analysing the data. My question relates to retrieval, I use something like this:

use LWP::Simple;
#each page is actually processed in a subroutine but you get the idea
$page1 = "http://www.first.com";  
$page1 = get($page1);
#rest of the pages in same format here
$pageN = "http://www.last.com";  
$pageN = get($pageN);
[download]

Now this has (at least) two problems which I need to solve: 1/ each of the pages is retrieved in turn whereas given the speed of the server the script is run on I want to get them all at once, so Q1: how do I implement multithreading here? 2/ if a page fails to respond I dont want the script to wait any more than N seconds before moving to the next, so Q2: how do I time a (sub)process and kill it after N seconds? Thanks for any help, Ntav

Comment on retrieving multiple web documents Download Code

Replies are listed 'Best First'.
Re: retrieving multiple web documents by wog (Curate) on Aug 30, 2001 at 06:09 UTC
Check out LWP::Parallel.	[reply]
Re: Re: retrieving multiple web documents by tomhukins (Curate) on Aug 30, 2001 at 15:46 UTC
POE::Component::Client::HTTP looks interesting, too, although when I had a quick look at it the documentation wasn't too great so I never got any code working. POE looks like an interesting tool for dealing with a variety of parallel tasks in Perl, though. Ntav's code might benefit from using an array of URLs instead of a series of scalars. For example: `use LWP::Simple; my @page = qw(http://www.first.com http://www.last.com); foreach (@page) { get($_); }` [download]	[reply] [d/l]
Re: retrieving multiple web documents by Zaxo (Archbishop) on Aug 30, 2001 at 06:18 UTC
Just recently, Parallel::ForkManager was reviewed here. Its pod includes a snippet for doing just what you want. After Compline, Zaxo	[reply]