in reply to Re: Logging URLs that don't return 1 with $mech->success
in thread Logging URLs that don't return 1 with $mech->success

Here's what I've come up with. Looks even uglier I think, but it looks like it worked. It just needs to skip over "mailto:" links, which is easy.

#!/usr/bin/perl -w use strict; ## initialize the objects that we need use WWW::Mechanize; ## used to fetch the page we want my $mech = WWW::Mechanize->new(); ## our ::Mechanize obje +ct ## initialize an array of "bad" links ## we'll write this to a file when we're done my @bad_links; ## site root my $site_root = "http://www.mscd.edu/~women/scholarships/"; ## array of URLs to check ## probably wanna stick these in a file in the future my @urls_to_check = ('schola-f.shtml', 'scholg-l.shtml', 'scholm-r.sht +ml', 'schols-z.shtml'); my $bad_links_file = "badlinks.txt"; my %checked_urls; ## Start! ## loop through our urls we need to check ## Thanks to Joost from perlmonks for ( @urls_to_check ) { print "Trying to get " . $site_root . $_ . "\n"; if ( $_ eq $checked_urls{$_} ) { print "Link checked, skipping\n"; next; } else { $mech->get( $site_root . $_ ); # or next if $site_root.$_ eq $ +checked_urls{$site_root.$_}; print "Got ". $site_root . $_ ."\n" unless $mech->success; $checked_urls{$site_root} = $site_root . $_; for my $link ($mech->find_all_links) { # on this page if ( $link->url eq $checked_urls{$link->url} ) { print "Link checked, skipping\n"; next; } else { print "Getting ". $link->url ."\n"; $mech->get($link->url); $checked_urls{$link->url} = $link->url; unless ($mech->success) { print "can't get ".$link->url.", status: ".$mech-> +status; push @bad_links, $link->url; } $mech->back; } } } } print "Finished checking links. Writing results.\n"; open (BADLINKS, '>>', $bad_links_file); for ( @bad_links ) { print BADLINKS $_ . "\n"; } close (BADLINKS); ## Finished!
meh.

Replies are listed 'Best First'.
Re^3: Logging URLs that don't return 1 with $mech->success
by ikegami (Patriarch) on Sep 11, 2008 at 01:14 UTC

    The top-level urls are processed differently than the links found at the urls, so it makes no sense to use the same "checked" hash for both types of urls.

    The following should be removed:

    if ( $_ eq $checked_urls{$_} ) { print "Link checked, skipping\n"; next; } else

      Thank, ikegami, I had been looking at that ad scratching my head over it. Two of the same if statements in two different places didn't really look right to me.

      meh.