schnibitz has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, first post, and I'm a bit of a novice at this stuff. Basically I want some code that will open a list of sites (in parallel), and search with those pages to find something. I have the code for that, but the problem with it (from what I can tell) is that it works a bit too good, and yet not good enough. It spawns a new thread for each site it opens up, which is cool and all, except for the fact that there can sometimes be a LOT of sites. This can bog down the server pretty quickly. I was thinking that one way I can fix it is if I can somehow limit the number of threads that run simultaneously to like 10 or 20. I sorta have to work within the modules below as well. LWP::Parallel (or whatever it's called) won't install as a module on my system, and also, LWP::UserAgent doesn't like to be used in threads, so HTTP::Lite seems like the best option ATM. One direction I was thinking was to implement some queuing, but I honestly don't know how to make that work with this code, but any suggestions are greatly appreciated. So here's the code:
use HTTP::Lite; use threads; my $a = 0; my $i = 0; sub request { print "trying to open up $address\n", scalar(localtime), "\n"; $http = new HTTP::Lite; $req = $http->request("$address") or die "Unable to get document: $!"; $i++; my $content = $http->body(); print " Done with request for $address\n"; my $search_variable = "test"; if ( $content =~ m/$search_variable/) { $a++ while ($content =~ m/$search_variable/g); } print " Finished searching $address for $search_variable", +scalar(localtime), "\n"; } my @addresses = ( 'http://www.website1.com', 'http://www.website2.com', . . . 'http://www.website50.com' ); #my $URLResult = $Result->Url; my @threads; foreach $address(@addresses) { my $thread = threads->create(\&request); push(@threads, $thread); } } foreach (@threads) { $_->join(); }

Replies are listed 'Best First'.
Re: Threaded Web requests
by rcaputo (Chaplain) on Jan 02, 2010 at 23:05 UTC

    If you were using something like POE::Component::Client::HTTP (see these recipes), I would recommend starting a polite number of parallel requests, perhaps 10 or 20, and then firing off a new request for each response that arrives. This is a handy way to limit parallelism without a lot of bookkeeping.

    You may still be able to do this. It depends whether you can join() on "the next thread to finish". For every thread that joins, start another with the next request. Exit when you've run out of @addresses and @threads.

Re: Threaded Web requests
by BrowserUk (Patriarch) on Jan 03, 2010 at 00:25 UTC
      These are great suggestions. I'm going to give them a try. Thank you very much.