My company is relocating some servers today, and one of the coworkers asked if I knew of a link spider that he could use to quickly check for (4|5)xx error codes to make sure things are working. Rather than spend minutes searching the web for one, I reinvented the wheel in 30 seconds. In fact, it's taken me longer to put together this node!!!
#!/usr/bin/perl use strict; use LWP::UserAgent; my @list = qw( http://www.yahoo.com http://w1.dev.chaffee.com http://foo.bar ); my $ua = LWP::UserAgent->new(); $ua->timeout(10); for (@list) { my $r = $ua->get($_); print "Try $_\n"; if ($r->code =~ /^(4|5).+/) { print "\n\nError with $_: ".$r->code.": ".$r->message.$/.$/; } }

Replies are listed 'Best First'.
Re: Reinventing the wheel
by jhourcle (Prior) on Mar 20, 2005 at 05:23 UTC

    I'd like to thank you for sharing your code, but your program isn't so much a 'spider' and a 'monitoring' tool. That is, it doesn't recursively go through the pagess, to find other links to follow -- it only looks for a single page (or a limited list of single pages), and checks that the page in question isn't generating an error.

    There are plenty of free "link checkers" available, such as from w3c and Linklint (written in Perl, and open source).

    If you're going to use something for monitoring, you might want to verify that the page is the same as a known good copy (or falls within tolerance of a good copy, if you are monitoring a dynamic page), as there are many things that can go wrong without generating an error. (eg, being served the apache default 'new server' page would be a 200)

      you seem to have not read my node. I stated that the goal was to quickly check a list of urls for 4xx/5xx errors. Not to completely spider a site, not to check one at a time, not to examine all links on a page for validity, and definately not to frequently monitor and report on availability. If you're going to negatively comment, then comment on topic, like my algorithm sucks or there's a better way to check that would take less than a minute to code. My point was that I wrote this in less time than it would take to download, untar linklint, see the instructions and get on to using it.

        I appologize, but your original post stated that you were looking for a link spider, and you did not state that for the problem you were trying to solve such behaviour would be overkill. You instead presented an alternative solution, which did not seem to fit the role of spider, as I understood it, because it did not parse the page and follow subsequent links. This may support the Reading the same text and getting a different impression thread.

        Personally, I have done webserver support for many years (since 1995 ... my first server migration was so we would have support for software virtual servers and SSL), and keep a number of tools on hand for testing. For quick tests, I typically either just bring them up in a web browser:

        $ netscape "url_goes_here"

        or for times when the hosts aren't in DNS yet:

        $ telnet server_ip_address GET /url_path HTTP/1.0 Host: url_hostname

        As another alternative, I have the respective content owners check their websites, while I keep an eye on the webserver's error log:

        $ tail -f /path/to/webserver/error_log

        You could also run wget against the site to begin the spidering, while you watch the logs, if you don't have linklint.

        I have done a number of large scale server migrations, and the only time that we have ever had a problem with the data migration that wasn't caught in our migration testing, it was because we did not sample a large enough number of the pages. (one of the shell scripts to get all of the files that had been modified since the last tape backup generated too long of a list in one of the user's directories, which resulted in too long of a list sent to tar, which failed silently). And of course, the files that were missed were from a message board for a distance learning program ... so I spent the next two days consolidating the posts between the two servers.

        Your original message also stated that the time savings were over finding a suitable program, and did not mention that you had a slow link, or were concerned with the time to get up to speed with the input parameters (although, a basic test with linklint is very simple). I admit that using Google for link spider pulls up crap, but this is one of those times where Yahoo does well. (okay, not on the original search, but it recommends 'linkspider', which has useful info.) Also, the search terms 'link checker' and 'link validator' both return useful results from Google.

        I apologize if you took offense to my first reply, but I intended it to be constructive, and point you towards other tools that might be useful should you perform a similar migration again, and that your code sample didn't act as a link spider, which I interpreted from your message that you had intended it to serve in that role.

        Update:: I suck at spelling.