See Parallel::ForkManager for a convenient way to spawn off new processes and to limit the the number of active ones.
You probably want to extract the host or domain of a link for your commendable desire to rate-limit your requests. URI.pm can do that for you. It's distributed with perl, I believe.
If you find external links in a document, a (perhaps partial) breadth-first traversal strategy will give you something to do while waiting to next hit the current domain. Don't forget about robot.rules. You ought to plan to to honor that.
A CPAN search for Robot or Parallel turns up LWP::Parallel::RobotUA and several other candidates to help with this.
After Compline,
Zaxo
In reply to Re: Creating a web crawler (theory)
by Zaxo
in thread Creating a web crawler (theory)
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |