gferguson has asked for the wisdom of the Perl Monks concerning the following question:

I just got a task of gathering content from 32K+ URLs from one of our customer's sites.

My first pass was quick 'n dirty, running single-threaded using LWP::UserAgent, and took over seven hours.

Now that I've got a chance to do some additional coding, I want to change to some sort of parallel gets.

For other tasks I tried using LWP::Parallel, but didn't see a huge speed-up.

As a different task I wrote some parallel gets using threads, which seemed to scale well and ran faster, but that warning that threads isn't ready for prime-time use when I build Perl always makes me nervous.

I looked at POE, but haven't built anything significant. It seems like it's more than adequate for the task, but, at the same time I feel like it's the elephant-gun approach - way too much for what should be a simple task of walking a list of URLs.

I'm leaning towards resurrecting my threads + LWP::UserAgent code because it seemed to be robust and fast(er).

What are the thoughts from the collective mind here about LWP::UserAgent and threads vs. LWP::Parallel vs. POE. And, is there something I might have missed?

Thanks.

Replies are listed 'Best First'.
Re: running multiple LWP gets
by zentara (Cardinal) on May 25, 2007 at 17:34 UTC
    First search http://groups.google.com for "LWP Parallel::ForkManager"

    I don't see why LWP::Parallel wasn't showing a speed improvement unless you set the max_req too low. $ua->max_req(30) defaults to 5.

    I don't think threads will help you, except to give realtime progress reports for each download. Try this: (there are similar examples on groups.google).

    #!/usr/bin/perl use Parallel::ForkManager; use LWP::Simple; use LWP::UserAgent ; use HTTP::Status ; use HTTP::Request ; %urls = ('drudge'=> 'http://www.drudgereport.com', 'rush' =>'http://www.rushlimbaugh.com/home/today.guest.html', 'yahoo' => 'http://www.yahoo.com', 'cds' => 'http://www.cdsllc.com/',); foreach $myURL (sort(values(%urls))){ $count++; print "Count is $count\n"; $document = DOCUMENT_RETRIEVER($myURL); } sub DOCUMENT_RETRIEVER{ $myURL=$_[0]; $mit = $myURL; print "Commencing DOCUMENT_RETRIEVER number $iteration for $mit\n"; print "Iteration is $iteration and Count is $count\n"; for ($iteration = $count; $iteration <= $count;$iteration++){ $name = $iteration; print "NAME $name\n" ; my $pm=new Parallel::ForkManager(30); $pm->start and next; print "Starting Child Process $iteration for $mit\n" ; $ua = LWP::UserAgent->new; $ua->agent("$0/0.1 " . $ua->agent); $req = new HTTP::Request 'GET' => "$mit"; $res = $ua->request($req, "$name.html"); print "Process $iteration Complete\n" ; $pm->finish; $pm->wait_all_childs; print "Waiting on children\n"; } undef $name; }

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
      DOH! I forgot about the newsgroups... yes, there's lots of mention there, so I'll see what I can dig up.

      Thanks!

      PS - thanks for the sample code. I'll poke at it and see what it does.

        Doh! :-), as you poke, just remember, I grabbed it without much testing. Now that I look close, it dosn't look like it sets up the Parallel ForkManager in a way that may not run properly.... but the Parallel ForkManager has some odd syntax since it's an object.

        I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: running multiple LWP gets
by Fletch (Bishop) on May 25, 2007 at 17:37 UTC

    Just a lateral thinking kind of suggestion: Ask the customer if they can give you a copy of the content directly rather than you walking their site. Aside from side stepping the problem of what to use, that also avoids any bottlenecks you'd run into by attempting to pull content faster than their boxen could provide it.

    And POE's not an elephant-gun approach, it's just a event-based wrapper around a traditional *NIX select-y approach. It can take a little getting into the mindset, but once that's done POE's pretty cool.

      Well... I did ask for content. Several times. And, as it usually goes in corporate 'Merica, some gatekeeper between me and them thought the badly mangled and truncated content I was receiving was good enough. Of course I wouldn't have been given the task if their solution had been working. :-)

      I'll look into POE and its examples some more. It seems really useful for other tasks we've got, just overkill for this particular one. My inexperience with it could be coloring my vision though.

      Thanks!

Re: running multiple LWP gets
by clueless newbie (Curate) on May 26, 2007 at 13:07 UTC
    HI, I've used LWP:UserAgent in conjunction with threads. With multiple threads I was able to keep my CPU completely pegged while downloading all of the Page 3 wallpapers. It isn't too difficult to do.