tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

Is there any way to combine the goodness of LWP::Parallel::UserAgent with WWW::Mechanize?

I believe WWW::Mechanize is a subclass of LWP::UserAgent, in case that's relevant.

Basically, I like the idiot-friendly interface of mechanize, but would like to be downloading pages in paralell. So, is there a way?

This somewhat revisits a question I asked some time ago -- What is the fastest way to download a bunch of web pages?. So, if this is a dead end, I do have Other Ways to Do It.

I'm only asking in case I overlooked something.

Thanks for your wisdom.

  • Comment on Combining LWP::Parallel::UserAgent with WWW::Mechanize

Replies are listed 'Best First'.
Re: Combining LWP::Parallel::UserAgent with WWW::Mechanize
by jasonk (Parson) on Apr 20, 2006 at 18:02 UTC

    Combining them would be difficult due to the nature of WWW::Mechanize (if someone calls the method that says 'follow this link', which of the pages being loaded in parallel do you follow it on?) but you could create an application that combined them using someting like Parallel::ForkManager to do the parallel parts, and create a WWW::Mechanize object for each forked process.


    We're not surrounded, we're in a target-rich environment!
Re: Combining LWP::Parallel::UserAgent with WWW::Mechanize
by perrin (Chancellor) on Apr 20, 2006 at 18:07 UTC
    You should consider just forking and using Mechanize from multiple processes instead. My experience with LWP::Parallel::UserAgent has been that it performs badly compared to a multi-process approach.
      Thanks Perrin.

      With multiple processes, I have tracked down two basic approaches, apart from LWP::Parallel which you didn't like.

      1. ThreadQueue -- Re: What is the fastest way to download a bunch of web pages?-- (thanks BrowserUK)
      2. Parallel::ForkManager (Suggested by jasonk above, and also mentioned on the "fastest way to download" thread)

      Do you think one way has any advantages over the other? Or are these ways essentially the same under the hood?

      FWIW I'm on linux now (new job -- yay! now I get perl in its native habitat :) ), since this seems to be relevant when forking comes into play. (Forking works better on linux.)

      Also, to give a bit more contetx, I'll be downloading potentially 10s of thousands of websites, but no more than 100 from any one particular domain.

        I don't use threads, but I do know that the memory consumption tends to be higher with threads than with an equivalent number of processes. I use Parallel::ForkManager, and it works well and reliably. Collecting the data can be more work than with threads though.