in reply to spidering, multi-threading and netiquette

First of all, if you want to reduce your load on the server's bandwidth, an easy way to do that is to use persistent connections - unfortunately, unless some LWP development has happened that I'm not aware of, LWP doesn't support that. Fortunately, some other libraries (libwhisker, for example) do.

That said, although rfc2616 is talking about persistent connections in this paragraph, I'd take it to heart even with non-persistent connections:

Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.
Finally, I'm a bit wary of the line "I estimate that the robot will send about one request per second." If that's your estimate, have some mechanism in place so that when it goes above 90 requests/minute the script is killed. I've seen far too many programs go wrong with a simple misplaced comma to trust that some program I write won't suddenly go wild without doing some testing first.

The simplest way to do this is to log all requests to the screen and be fast with the Ctrl-C when things go bad.

Replies are listed 'Best First'.
Re: Re: parallel downloading
by flyingmoose (Priest) on Feb 21, 2004 at 17:35 UTC
    The simplest way to do this is to log all requests to the screen and be fast with the Ctrl-C when things go bad.
    To the OP -- If it has any chance of going out of control, there should be sleep instructions embedded in the code to reduce load. During debug, these intervals should be fairly long (0.5 - 1 second between requests?). Once you learn the script is well-behaved, you may be able to shorten them somewhat. As the bot writer, you have the utmost responsibility to limit your scans to the bare minimum possible. Not only does bandwidth cost money, but you could be slowing down access for other users. Also, if you are a simple spider, don't do something evil like run it continuously -- run it on a crontab (with a long interval) or manually.

    For sleeping between requests, check out Time::HiRes

    As to the multithread question, this should be entirely up to the site admin. If he says no, don't spider it at all. If this were my site, I'd consider a multithreaded spider quite abusive, since it would be doing things normal web browsers would not do.

      The best way to ensure you have reasonable delay's in your requests, is to use a User Agent that enforces those delays, ala: LWP::RobotUA and LWP::Parallel::RobotUA

      You should also keep in mind, that just because th server doesn't have a robots.txt today, doesn't mean it won't have one tomorow ... so make sure your code checks for it each time it's run: WWW::RobotRules.