rmckillen has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am having a little problem with LWP. This script checks links in a mysql database. Everything works ok, but as I was testing I noticed an issue. When it checks a url that no longer exists, but the server it formerly resided on has custom 404 error pages, the link checks as being valid. I guess what I want to do is only consider the url valid if it returns a 200. How can I modify the script to work like this?
#!/usr/bin/perl use LWP::UserAgent; use DBI; $db_database = "db"; $db_uid = "user"; $db_pwd = "pass"; ($ua = LWP::UserAgent->new)->timeout(20); #actually set timeout $dbh = DBI->connect ("DBI:mysql:$db_database".$mysqlsock, $db_uid, $db +_pwd) or die("could not connect to db\n"); $sth = $dbh->prepare("SELECT url FROM files"); $sth -> execute(); $numrows = $sth->rows; $i = 0; $works = 0; $notworks = 0; print "\n\n"; #while (my $url = $sth->fetchrow_array) { while (defined(my $url = $sth->fetchrow_array)) { if(($ua->request(HTTP::Request->new('HEAD', $url)))->is_success()) { $validity = "link works"; $valid_update = $dbh->do("UPDATE files SET valid = 1 WHERE url = '$url +'"); ++$works } else { $validity = "link sucks"; $valid_update = $dbh->do("UPDATE files SET valid = valid + 1 WHERE url + = '$url'"); ++$notworks; } ++$i; print "$i of $numrows\n$validity\n$url\n\n"; } $sth->finish;

Replies are listed 'Best First'.
Re: help with link checking
by arturo (Vicar) on Mar 06, 2001 at 23:21 UTC

    From the LWP documentation :

    The libwww-perl response object has the class name
           `HTTP::Response'.  The main attributes of objects of this
           class are:
    
           ·  The code is a numerical value that indicates the
              overall outcome of the request.
    

    so, what you want to do is change

    if(($ua->request(HTTP::Request->new('HEAD', $url)))->is_success())
    and following lines to :
    my $response = $ua->request(HTTP::Request->new('HEAD', $url)); if ($response->code == 200) { # url is valid } else { # there's something odd with the URL }

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

      One minor caveat with using HEAD: some off-brand web servers don't give a 200 response if you try to HEAD a CGI.

Re: help with link checking
by arhuman (Vicar) on Mar 06, 2001 at 23:26 UTC
    UPDATE : Once again I typed/checked it too slowly...
    An please stop voting on this node! upvote the one above...


    Replacing
    if(($ua->request(HTTP::Request->new('HEAD', $url)))->is_success())
    by
    if(($ua->request(HTTP::Request->new('HEAD', $url)))->code() == 200)
    should work...
      I thought that code would work, but it doesn't. Check out this simplified version of the script:
      #!/usr/bin/perl use LWP::UserAgent; ($ua = LWP::UserAgent->new)->timeout(20); $url = "http://202.103.25.186/music/Phil%20Collins%20-%20That's%20What +%20You%20Said.mp3"; if(($ua->request(HTTP::Request->new('HEAD', $url)))->code() == 200) { print "GOOD\n"; } else { print "BAD\n"; }
      When this script is run, it will print GOOD. However, check out http://202.103.25.186/music/Phil%20Collins%20-%20That's%20What%20You%20Said.mp3 in your browser and you'll see that this doesn't serve up an mp3 file, rather an html page. What can I do?
        Ahh, it got a redirect! Use simple_request instead of request and you'll get back the 301/302 code which you can use as a "bad link" confirmation.

        That's what I've done in the various link checkers I've written recently.

        -- Randal L. Schwartz, Perl hacker