stonecolddevin has asked for the wisdom of the Perl Monks concerning the following question:

Howdy gang,

I'm working on a university website that has a list of links to scholarships for students. I've been handed the task of making sure all these links are "working," which as far as I'm concerned means they go somewhere.

I've been working on a script using WWW::Mechanize to:

  1. loop through a list of the pages that contain these links
  2. loop through the current page's links and use ->follow_link( text_regex => q/$_/i to find and follow the current link
  3. check the status of the last request using ->success
  4. pushing unsuccessful (where ->success returns a 0) requests for links into an array called @bad_links
  5. and finally, writing @bad_links to a text file.

When I run the script I've written, it seems to get stuck in an infinite loop with the first link.

No errors, and I don't really know what could be holding it up.

Here is the source listing, which isn't pretty but will hopefully eventually do the job:

#!/usr/bin/perl -w use strict; ## initialize the objects that we need use WWW::Mechanize; ## used to fetch the page we want my $mech = WWW::Mechanize->new(); ## our ::Mechanize obje +ct ## initialize an array of "bad" links ## we'll write this to a file when we're done my @bad_links; ## site root my $site_root = "http://www.mscd.edu/~women/scholarships/"; ## array of URLs to check ## probably wanna stick these in a file in the future my @urls_to_check = ('schola-f.shtml', 'scholg-l.shtml', 'scholm-r.sht +ml', 'schols-z.shtml'); my $bad_links_file = "badlinks.txt"; ## Start! ## loop through our urls we need to check for ( @urls_to_check ) { print "Getting $site_root$_...\n"; $mech->get( $site_root . $_ ); if ( $mech->success ) { print "Successfully retrieved $site_root$_\n"; } else { print "Couldn't retrieve $site_root$_!\n"; } ## loop through our list of links while ( $mech->links ) { print "Following $_\n"; $mech->follow_link( text_regex => qr/$_/i ); ## we need to either move on to the next link if this one is ## successful or push it into the @bad_links array if it isn't if ( $mech->success ) { print "Successfully followed $_\n"; } else { push @bad_links, $_; print "Unsuccessful in retrieving $_, moving on\n"; } } } print "Finished checking links. Writing results.\n"; open (BADLINKS, '>>', $bad_links_file); for ( @bad_links ) { print BADLINKS $_ . "\n"; } close (BADLINKS); ## Finished!

Thanks in advance!

meh.

Replies are listed 'Best First'.
Re: Logging URLs that don't return 1 with $mech->success
by moritz (Cardinal) on Sep 10, 2008 at 17:07 UTC
    When I run the script I've written, it seems to get stuck in an infinite loop with the first link.

    Maybe there's a self-referential link on that page? The usual approach is to use a hash that store all visited URLs, and don't visit them again.

      Aha.

      I had a feeling I missed something. I'll give that a go, hopefully my methodology for checking to see if a link is 'valid' or not is going to work.

      meh.
Re: Logging URLs that don't return 1 with $mech->success
by Joost (Canon) on Sep 10, 2008 at 23:35 UTC
    # loop through the current page's links and use ->follow_link( text_re +gex => q/$_/i to find and follow the current link
    That would only follow the first link on that page matching some regex.

    That may be what you want, but it reads as though you'd want to do something like:

    for my $link ($mech->find_all_links) { # on this page $mech->get($link->url); unless ($mech->success) { warn "can't get ".$link->url.", status: ".$mech->status; } $mech->back; }

      Here's what I've come up with. Looks even uglier I think, but it looks like it worked. It just needs to skip over "mailto:" links, which is easy.

      meh.

        The top-level urls are processed differently than the links found at the urls, so it makes no sense to use the same "checked" hash for both types of urls.

        The following should be removed:

        if ( $_ eq $checked_urls{$_} ) { print "Link checked, skipping\n"; next; } else
Re: Logging URLs that don't return 1 with $mech->success
by ikegami (Patriarch) on Sep 11, 2008 at 01:10 UTC
    while ( $mech->links )

    should be

    for ( $mech->links )
    • while doesn't loop over a list and $mech->links isn't an iterator.
    • while doesn't set $_.

    Also, I'm not convinced of the reliability of following a link on Page A after having followed a link to Page B. It appears to work for now, but there could easily be side effects, and the behaviour could easily change in the future.

      Also, I'm not convinced of the reliability of following a link on Page A after having followed a link to Page B

      I don't really understand this. Can you explain where the unreliability is?

      Thanks for pointing out the while issue!

      meh.

        Potential unreliability. It doesn't look right to me to follow a link that exists on a page the Mechanize object no longer has loaded. It could very well be that the Link object is independent of the page that spawned it, but to rely on that sounds dangerous to me. It might not be, but it's worth looking into and adding comments explaining this.

Re: Logging URLs that don't return 1 with $mech->success
by Limbic~Region (Chancellor) on Sep 11, 2008 at 15:28 UTC
    dhoss,
    Let's assume your employer is so happy with the work you have done and how quickly, they want you to now check that all the pages linked to from the scholarship page have valid links, and the ones they link to have...

    With only small variations, your code can be turned into a depth first search (DFS).

    my (%seen, @bad_link); for my $url (@base_pages) { my @work = get_links($url); while (@work) { my $link = pop @work; next if $seen{$link}++; if (is_good($link)) { push @work, get_links($link); } else { push @bad_link, $link; } } }

    I know this is your employer and obeying the rules of robots.txt probably doesn't apply to you but you should keep it in mind for any crawler you write as well as a delay between page fetches to be nice to the server.

    Cheers - L~R

      Limbic~Region,

      Thanks very much!

      My next step was to add "throttling" or what have you so that I'm not querying a given site inconsiderately. I didn't even really think of the DFS, that's a pretty neat idea! I'll play with this, and propose the idea to my employer.

      I think that perhaps I could even extend this into something on the backend admin panel I'm surely going to be writing (for EVERYONE'S sanity) that could use an internal/external link boolean that would potentially make this more robust and with any luck fast(er).

      Thanks again, that's a neat idea! :-)

      meh.
        dhoss,
        Actually, I just realized you could have a monster on your hands without one more sanity check:
        # push @work, get_links($link); push @work, get_links($link) if ! off_site($link);
        I am sure somewhere on the university website there is a link off-site and you don't want to end up crawling the entire internet - it could take a while (and get you fired).

        Cheers - L~R