in reply to Logging URLs that don't return 1 with $mech->success

dhoss,
Let's assume your employer is so happy with the work you have done and how quickly, they want you to now check that all the pages linked to from the scholarship page have valid links, and the ones they link to have...

With only small variations, your code can be turned into a depth first search (DFS).

my (%seen, @bad_link); for my $url (@base_pages) { my @work = get_links($url); while (@work) { my $link = pop @work; next if $seen{$link}++; if (is_good($link)) { push @work, get_links($link); } else { push @bad_link, $link; } } }

I know this is your employer and obeying the rules of robots.txt probably doesn't apply to you but you should keep it in mind for any crawler you write as well as a delay between page fetches to be nice to the server.

Cheers - L~R

Replies are listed 'Best First'.
Re^2: Logging URLs that don't return 1 with $mech->success
by stonecolddevin (Parson) on Sep 12, 2008 at 01:49 UTC

    Limbic~Region,

    Thanks very much!

    My next step was to add "throttling" or what have you so that I'm not querying a given site inconsiderately. I didn't even really think of the DFS, that's a pretty neat idea! I'll play with this, and propose the idea to my employer.

    I think that perhaps I could even extend this into something on the backend admin panel I'm surely going to be writing (for EVERYONE'S sanity) that could use an internal/external link boolean that would potentially make this more robust and with any luck fast(er).

    Thanks again, that's a neat idea! :-)

    meh.
      dhoss,
      Actually, I just realized you could have a monster on your hands without one more sanity check:
      # push @work, get_links($link); push @work, get_links($link) if ! off_site($link);
      I am sure somewhere on the university website there is a link off-site and you don't want to end up crawling the entire internet - it could take a while (and get you fired).

      Cheers - L~R