aarestad has asked for the wisdom of the Perl Monks concerning the following question:
#!/usr/bin/perl use strict; use warnings; use HTML::LinkExtractor; use LWP::Simple qw(get); # Base for searching my $base = "http://business.intra.company.com"; # Create the LinkExtractor object for use in the subroutine my $LX = new HTML::LinkExtractor(undef, "$base/it/"); # List of all the links found my @allLinks; # Start here on the recursive traversal recursiveFollow("/it/index.html"); foreach (@allLinks) { # Print a list of resources to be used on apollo print "/wwwprod/docs/business/docs" . $_->path, "\n"; } sub recursiveFollow { my $file = shift; my $html = get("$base$file"); my @thisDocLinks; if (!defined $html) { warn "file not found: $base$file\n"; return; } # DEBUG print "got $base$file\n"; # /DEBUG $LX->parse(\$html); for my $link (@{ $LX->links }) { next if !defined $$link{href}; # Stash the link if it's a relative link or it begins with $ba +se # but NOT if it's a "file:///" URI if (($$link{href} !~ /^http:/ || $$link{href} =~ /business\.in +tra/) && ($$link{href} !~ /^file:/)) { push @allLinks, $$link{href}; if ($$link{href} =~ /\.html?$/) { push @thisDocLinks, $$link{href}; } } } # Follow each link to an htm/html file found in this file recursiv +ly foreach (@thisDocLinks) { recursiveFollow($_->path); } }
There are two problems, one moderate and one major. The moderate problem is that HTML::LinkExtractor::parse() doesn't seem to pick out IMG locations - this is an important part of my analysis to discover IMG links. Is there a hack I can use to search the $html I get to get the IMG SRC locations?
The major problem is that it get stuck in an infinite loop, seemingly because it's trying to access the same file over and over. Obviously I just want to visit any one page exactly once, but I'm getting this trace when running:
$ ./findorphans.pl got http://business.intra.company.com/it/index.html got http://business.intra.company.com/it/meetings/meeting_schedule.htm file not found: http://business.intra.company.com/meetings/meeting_sch +edule.htm file not found: http://business.intra.company.com/associates/associate +_info.htm file not found: http://business.intra.company.com/associates/index.htm +l got http://business.intra.company.com/it/don/spotlight_winners.html file not found: http://business.intra.company.com/ask/index.html file not found: http://business.intra.company.com/meetings/meeting_sch +edule.htm file not found: http://business.intra.company.com/associates/associate +_info.htm file not found: http://business.intra.company.com/associates/index.htm +l got http://business.intra.company.com/it/associates/associate_info.htm file not found: http://business.intra.company.com/../index.html got http://business.intra.company.com/index.html got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm ...
You get the picture. Can any sharp-eyed Monks see what's obviously wrong with this code? (I'm thinking use a hash to prevent duplications, but how?...)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Problems with cruft-finding script
by PodMaster (Abbot) on Dec 10, 2003 at 17:24 UTC | |
by aarestad (Sexton) on Dec 10, 2003 at 18:12 UTC | |
|
Re: Problems with cruft-finding script
by Abigail-II (Bishop) on Dec 10, 2003 at 17:24 UTC |