in reply to catching failures and retrying

Most spiders are based around an array used as a queue. You push the first (set of) urls onto the queue and then start the spider, it pulls a url off the queue, fetches it, extracts any links and pushes them onto the queue and loops.

To support retries, instead of pushing just the url, push a url/count pair. Either as an anon array

push @urls, [ $tries, $url ];

Or you could concatenate them into a string if lots of 2-element arrays proves to be a memory problem.

Preset $tries to 3 or 5 or whatever, and each time you fail, decrement the count and push it back on the queue if it hasn't reached 0 yet. When it reaches 0 give up and report the failure.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?