in reply to Verifying external web links

As suggested already LWP is your solution.

However, I will point out that any solution should not be a 'try once and fail', but should instead be along the lines of '3 strikes and then fail'. That is, with the connectivity of the internet today, while most major commercial sites are up 99.9+% of the time, many off beat sites will sometimes be inaccessable due to lower-grade ISP (eg dealing with residental broadband). These sites might not be up at the time you try them, but maybe 2 mins, 2 hours, or 2 days later they will be. The best way to do link checking is to test a site; if not there, try it again the next day, then the next week, and then possibly the week after that, ideally at sufficiently different times of the day (midnight, 6a, noon, 6p). This should cover things like DNS resolution issues, network outages, and equipment replacements that might occur. If a site fails all 3 or 4 times, then it's probably gone.

-----------------------------------------------------
Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
"I can see my house from here!"
It's not what you know, but knowing how to find it if you don't know that's important

Replies are listed 'Best First'.
Re: Re: Verifying external web links
by Fastolfe (Vicar) on Dec 06, 2001 at 01:03 UTC

    Whether or not you retry should depend on the nature of the failure. If you get back a 400- or 500-series response, you should generally stop there, since the server has pretty much stated, "No way, no how." A possible exception to this would be a 408 (timeout) response and arguably 500, since it's possible the error is temporary.

    If, on the other hand, (like you discuss), the request fails due to a connection problem (connection refused, timed out, no route to host), I might wait a bit (hours? days?) and try again.

      I'd argue that 404 should be rechecked too, though most likely, any site that starts off with a 404 error will end up off the list, more so than 408s, 500s, or connection problems. Sometimes, if you've linked 'deep' into a site (anywhere off the front page, or in a user's account), the server's storage might be switched around, and in a short time frame, you might get 404s, but outside, the page would be accessible normally. There's other reasons that I can think of as well, which are not unlikely but are uncommon, that I'd check pages repeatedly regardless of error.

      That said, it certainly would not be too hard with such a tool to report in a log file why links were removed, allowing for the person to chase down those that might be recoverable (404s commonly), as opposed to those that are probably lost for good (no connection over serveral attempts).

      -----------------------------------------------------
      Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
      "I can see my house from here!"
      It's not what you know, but knowing how to find it if you don't know that's important