in reply to Logging URLs that don't return 1 with $mech->success
With only small variations, your code can be turned into a depth first search (DFS).
my (%seen, @bad_link); for my $url (@base_pages) { my @work = get_links($url); while (@work) { my $link = pop @work; next if $seen{$link}++; if (is_good($link)) { push @work, get_links($link); } else { push @bad_link, $link; } } }
I know this is your employer and obeying the rules of robots.txt probably doesn't apply to you but you should keep it in mind for any crawler you write as well as a delay between page fetches to be nice to the server.
Cheers - L~R
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Logging URLs that don't return 1 with $mech->success
by stonecolddevin (Parson) on Sep 12, 2008 at 01:49 UTC | |
by Limbic~Region (Chancellor) on Sep 12, 2008 at 13:13 UTC |