If you don't mind a non-Perl solution, you might consider
Heritrix. It is capable of large scale crawling, is kind to hosts that it visits (if a host takes
n seconds to respond, it will wait
m*n seconds before hitting that host again;
m is configurable but defaults to 5) and has extremely flexible crawl settings.