Logging URLs that don't return 1 with $mech->success

stonecolddevin has asked for the wisdom of the Perl Monks concerning the following question:

Howdy gang,

I'm working on a university website that has a list of links to scholarships for students. I've been handed the task of making sure all these links are "working," which as far as I'm concerned means they go somewhere.

I've been working on a script using WWW::Mechanize to:

loop through a list of the pages that contain these links
loop through the current page's links and use ->follow_link( text_regex => q/$_/i to find and follow the current link
check the status of the last request using ->success
pushing unsuccessful (where ->success returns a 0) requests for links into an array called @bad_links
and finally, writing @bad_links to a text file.

When I run the script I've written, it seems to get stuck in an infinite loop with the first link.

No errors, and I don't really know what could be holding it up.

Here is the source listing, which isn't pretty but will hopefully eventually do the job:

#!/usr/bin/perl -w
use strict;

## initialize the objects that we need
use WWW::Mechanize;            ## used to fetch the page we want

my $mech       = WWW::Mechanize->new();        ## our ::Mechanize obje
+ct

## initialize an array of "bad" links 
## we'll write this to a file when we're done
my @bad_links;

## site root
my $site_root = "http://www.mscd.edu/~women/scholarships/";

## array of URLs to check
## probably wanna stick these in a file in the future
my @urls_to_check = ('schola-f.shtml', 'scholg-l.shtml', 'scholm-r.sht
+ml', 'schols-z.shtml');


my $bad_links_file = "badlinks.txt";

## Start!
## loop through our urls we need to check
for ( @urls_to_check ) {

    print "Getting $site_root$_...\n";
    
    $mech->get( $site_root . $_ );
    
    if ( $mech->success ) {

        print "Successfully retrieved $site_root$_\n";

    } else { 

       print "Couldn't retrieve $site_root$_!\n";

    }

    ## loop through our list of links
    while ( $mech->links ) {
        
        print "Following $_\n";
        $mech->follow_link( text_regex => qr/$_/i );
        
        ## we need to either move on to the next link if this one is
        ## successful or push it into the @bad_links array if it isn't
        if ( $mech->success ) {
            
            print "Successfully followed $_\n";

        } else {
 
           push @bad_links, $_;
           print "Unsuccessful in retrieving $_, moving on\n";
   
        }

    }

}

print "Finished checking links.  Writing results.\n";
open (BADLINKS, '>>', $bad_links_file);
for ( @bad_links ) {
    
    print BADLINKS $_ . "\n";

}
close (BADLINKS); 



## Finished!
[download]

Thanks in advance!

meh.

Comment on Logging URLs that don't return 1 with $mech->success Select or Download Code

Replies are listed 'Best First'.
Re: Logging URLs that don't return 1 with $mech->success by moritz (Cardinal) on Sep 10, 2008 at 17:07 UTC
When I run the script I've written, it seems to get stuck in an infinite loop with the first link. Maybe there's a self-referential link on that page? The usual approach is to use a hash that store all visited URLs, and don't visit them again.	[reply]
Re^2: Logging URLs that don't return 1 with $mech->success by stonecolddevin (Parson) on Sep 10, 2008 at 17:10 UTC
Aha. I had a feeling I missed something. I'll give that a go, hopefully my methodology for checking to see if a link is 'valid' or not is going to work. meh.	[reply]
Re: Logging URLs that don't return 1 with $mech->success by Joost (Canon) on Sep 10, 2008 at 23:35 UTC
`# loop through the current page's links and use ->follow_link( text_re +gex => q/$_/i to find and follow the current link` [download] That would only follow the first link on that page matching some regex. That may be what you want, but it reads as though you'd want to do something like: `for my $link ($mech->find_all_links) { # on this page $mech->get($link->url); unless ($mech->success) { warn "can't get ".$link->url.", status: ".$mech->status; } $mech->back; }` [download] "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^2: Logging URLs that don't return 1 with $mech->success by stonecolddevin (Parson) on Sep 11, 2008 at 00:10 UTC
Here's what I've come up with. Looks even uglier I think, but it looks like it worked. It just needs to skip over "mailto:" links, which is easy. Read more... (3 kB) meh.	[reply] [d/l]
Re^3: Logging URLs that don't return 1 with $mech->success by ikegami (Patriarch) on Sep 11, 2008 at 01:14 UTC
The top-level urls are processed differently than the links found at the urls, so it makes no sense to use the same "checked" hash for both types of urls. The following should be removed: `if ( $_ eq $checked_urls{$_} ) { print "Link checked, skipping\n"; next; } else` [download]	[reply] [d/l]
Re^4: Logging URLs that don't return 1 with $mech->success by stonecolddevin (Parson) on Sep 11, 2008 at 01:19 UTC
Re: Logging URLs that don't return 1 with $mech->success by ikegami (Patriarch) on Sep 11, 2008 at 01:10 UTC
`while ( $mech->links )` [download] should be `for ( $mech->links )` [download] `while` doesn't loop over a list and `$mech->links` isn't an iterator. `while` doesn't set `$_`. Also, I'm not convinced of the reliability of following a link on Page A after having followed a link to Page B. It appears to work for now, but there could easily be side effects, and the behaviour could easily change in the future.	[reply] [d/l] [select]
Re^2: Logging URLs that don't return 1 with $mech->success by stonecolddevin (Parson) on Sep 11, 2008 at 01:16 UTC
Also, I'm not convinced of the reliability of following a link on Page A after having followed a link to Page B I don't really understand this. Can you explain where the unreliability is? Thanks for pointing out the `while` issue! meh.	[reply] [d/l]
Re^3: Logging URLs that don't return 1 with $mech->success by ikegami (Patriarch) on Sep 11, 2008 at 01:49 UTC
Potential unreliability. It doesn't look right to me to follow a link that exists on a page the Mechanize object no longer has loaded. It could very well be that the Link object is independent of the page that spawned it, but to rely on that sounds dangerous to me. It might not be, but it's worth looking into and adding comments explaining this.	[reply]
Re: Logging URLs that don't return 1 with $mech->success by Limbic~Region (Chancellor) on Sep 11, 2008 at 15:28 UTC
dhoss, Let's assume your employer is so happy with the work you have done and how quickly, they want you to now check that all the pages linked to from the scholarship page have valid links, and the ones they link to have... With only small variations, your code can be turned into a depth first search (DFS). `my (%seen, @bad_link); for my $url (@base_pages) { my @work = get_links($url); while (@work) { my $link = pop @work; next if $seen{$link}++; if (is_good($link)) { push @work, get_links($link); } else { push @bad_link, $link; } } }` [download] I know this is your employer and obeying the rules of robots.txt probably doesn't apply to you but you should keep it in mind for any crawler you write as well as a delay between page fetches to be nice to the server. Cheers - L~R	[reply] [d/l]
Re^2: Logging URLs that don't return 1 with $mech->success by stonecolddevin (Parson) on Sep 12, 2008 at 01:49 UTC
Limbic~Region, Thanks very much! My next step was to add "throttling" or what have you so that I'm not querying a given site inconsiderately. I didn't even really think of the DFS, that's a pretty neat idea! I'll play with this, and propose the idea to my employer. I think that perhaps I could even extend this into something on the backend admin panel I'm surely going to be writing (for EVERYONE'S sanity) that could use an internal/external link boolean that would potentially make this more robust and with any luck fast(er). Thanks again, that's a neat idea! :-) meh.	[reply]
Re^3: Logging URLs that don't return 1 with $mech->success by Limbic~Region (Chancellor) on Sep 12, 2008 at 13:13 UTC
dhoss, Actually, I just realized you could have a monster on your hands without one more sanity check: `# push @work, get_links($link); push @work, get_links($link) if ! off_site($link);` [download] I am sure somewhere on the university website there is a link off-site and you don't want to end up crawling the entire internet - it could take a while (and get you fired). Cheers - L~R	[reply] [d/l]