shaolin_gungfu has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I've built a web robot using the RobotUA module from CPAN - it works pretty well, although i've had a few problems like it just waiting for a request for ages and ages without timing out, but anyway that's not my question.

The thing is it works sequentially, and that's a bit slow really, I can only visit about a few hundred URLs an hour. I've heard about a Parallel version of the RobotUA module, and I was wondering if anyone could tell me anything about it? i.e. will it be much better? what kind of performance increase will I see? will it be worth the effort? etc.

Also, i'm using Activestate Perl on Windows ME, so I normally use their Perl Package Manager program to install modules, except it's not working at the moment. So can I just get the modules from CPAN, if so how can I install them in my version of Perl?

Thanks,
Tom

Replies are listed 'Best First'.
Re: Building a Parallel Robot
by abstracts (Hermit) on Apr 29, 2002 at 23:14 UTC
    Hello

    You say:

    "... that's a bit slow really, I can only visit about a few hundred URLs an hour."

    What is your connection speed? How big is the average size of each URL? If you're on a 56kbits/s modem, and the average website is 56kbytes (including headers and other overhead), then you can get a URL/8secs, or about 450 URLs/hour and that should not be slow considered.

    Now if you (like me) can get 5MBits/sec, then that would be another story.

Re: Building a Parallel Robot
by Molt (Chaplain) on Apr 30, 2002 at 11:58 UTC

    Don't forget that spidering a site overly rapidly is a good way to stop others being able to get to it, and thus really annoying the webmaster.

    If you are going to try some kind of parallel robots then make sure you're not hammering all of their bandwidth. Sometimes it pays to do these things a little slower, it's more polite.

    If you're on dialup then it's not overly important, but I have been on the receiving end of a very high bandwidth client spidering us and I can assure you I wasn't saying 'Oh, how nice.. they're showing their interest'. I was phoning a colleague to modify firewall rules to stop them eating up more of our bandwidth before it choked everything.

    Yes, I know this server could have been set up better, but then so could most of them out there on the Internet.

    Not really used the Perl robot stuff in anger yet, although this may change soon. I do know that most spidering things have a 'wait period' and 'maximum bandwidth' setting, and I'd heavily recommend their use'.

Re: Building a Parallel Robot
by cjf (Parson) on Apr 29, 2002 at 22:06 UTC
Re: Building a Parallel Robot
by tomhukins (Curate) on Apr 29, 2002 at 21:47 UTC

    Before asking a question here, take a look through the old discussions. There's lots of useful material in there. If you type parallel into the search box at the top of this page you'll find plenty of useful discussion on this topic. If you have any specific queries after you've read through the old posts, just ask.

    As for your PPM problem, in what way is it not working? In my experience, most PPM problems are caused by firewalling or proxy misconfiguration.

      oops sorry, should have checked the old material first - that's a novice mistake I apologise. Thanks for the advice anyways :-)