in reply to catching failures and retrying
Most spiders are based around an array used as a queue. You push the first (set of) urls onto the queue and then start the spider, it pulls a url off the queue, fetches it, extracts any links and pushes them onto the queue and loops.
To support retries, instead of pushing just the url, push a url/count pair. Either as an anon array
push @urls, [ $tries, $url ];
Or you could concatenate them into a string if lots of 2-element arrays proves to be a memory problem.
Preset $tries to 3 or 5 or whatever, and each time you fail, decrement the count and push it back on the queue if it hasn't reached 0 yet. When it reaches 0 give up and report the failure.
|
|---|