aarestad has asked for the wisdom of the Perl Monks concerning the following question:

I posted last week about trying to find a solution to a familiar web problem: finding content in a web directory that is not linked to anything in that subweb. The best solution that the Monks came up with is to get the paths of each document I come across and print them out using their full Unix path name. Here's my first attempt (with some links changed to protect the name of my innocent company :):

#!/usr/bin/perl use strict; use warnings; use HTML::LinkExtractor; use LWP::Simple qw(get); # Base for searching my $base = "http://business.intra.company.com"; # Create the LinkExtractor object for use in the subroutine my $LX = new HTML::LinkExtractor(undef, "$base/it/"); # List of all the links found my @allLinks; # Start here on the recursive traversal recursiveFollow("/it/index.html"); foreach (@allLinks) { # Print a list of resources to be used on apollo print "/wwwprod/docs/business/docs" . $_->path, "\n"; } sub recursiveFollow { my $file = shift; my $html = get("$base$file"); my @thisDocLinks; if (!defined $html) { warn "file not found: $base$file\n"; return; } # DEBUG print "got $base$file\n"; # /DEBUG $LX->parse(\$html); for my $link (@{ $LX->links }) { next if !defined $$link{href}; # Stash the link if it's a relative link or it begins with $ba +se # but NOT if it's a "file:///" URI if (($$link{href} !~ /^http:/ || $$link{href} =~ /business\.in +tra/) && ($$link{href} !~ /^file:/)) { push @allLinks, $$link{href}; if ($$link{href} =~ /\.html?$/) { push @thisDocLinks, $$link{href}; } } } # Follow each link to an htm/html file found in this file recursiv +ly foreach (@thisDocLinks) { recursiveFollow($_->path); } }

There are two problems, one moderate and one major. The moderate problem is that HTML::LinkExtractor::parse() doesn't seem to pick out IMG locations - this is an important part of my analysis to discover IMG links. Is there a hack I can use to search the $html I get to get the IMG SRC locations?

The major problem is that it get stuck in an infinite loop, seemingly because it's trying to access the same file over and over. Obviously I just want to visit any one page exactly once, but I'm getting this trace when running:

$ ./findorphans.pl got http://business.intra.company.com/it/index.html got http://business.intra.company.com/it/meetings/meeting_schedule.htm file not found: http://business.intra.company.com/meetings/meeting_sch +edule.htm file not found: http://business.intra.company.com/associates/associate +_info.htm file not found: http://business.intra.company.com/associates/index.htm +l got http://business.intra.company.com/it/don/spotlight_winners.html file not found: http://business.intra.company.com/ask/index.html file not found: http://business.intra.company.com/meetings/meeting_sch +edule.htm file not found: http://business.intra.company.com/associates/associate +_info.htm file not found: http://business.intra.company.com/associates/index.htm +l got http://business.intra.company.com/it/associates/associate_info.htm file not found: http://business.intra.company.com/../index.html got http://business.intra.company.com/index.html got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm ...

You get the picture. Can any sharp-eyed Monks see what's obviously wrong with this code? (I'm thinking use a hash to prevent duplications, but how?...)

Replies are listed 'Best First'.
Re: Problems with cruft-finding script
by PodMaster (Abbot) on Dec 10, 2003 at 17:24 UTC
    You need to read the docs a little closer, especially WHAT'S A LINK-type tag

    As a general strategy, when in doubt, Dumper :)

    use HTML::LinkExtractor; use Data::Dumper; my $input = q{ <blockquote cite="http://crazyinsomniac.perlmonk.org/index2.html"> Now that's some goood feedass (ass back it's all good) %') </blockquote> If <a href="http://perl.com/"> I am a LINK!!! </a> <IMG SRC="YODAYODAYODAYODAYODAYODAYODAYODAYOD.png" ALT="It's YODA +!"> }; my $LX = new HTML::LinkExtractor(); $LX->parse(\$input); print Dumper($LX->links); __END__ $VAR1 = [ { 'cite' => 'http://crazyinsomniac.perlmonk.org/index2.html' +, '_TEXT' => '<blockquote cite="http://crazyinsomniac.perlmo +nk.org/index2.html"> Now that\'s some goood feedass (ass back it\'s all good) %\') </blockquote>', 'tag' => 'blockquote' }, { '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a +>', 'href' => 'http://perl.com/', 'tag' => 'a' }, { 'alt' => 'It\'s YODA!', 'src' => 'YODAYODAYODAYODAYODAYODAYODAYODAYOD.png', 'tag' => 'img' } ];

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      Ah, from the author himself. :) Yes, you're right - I thought I read the docs carefully, but I missed that. :( So I suppose a good way to get the IMG SRCs is:

      foreach(@{$LX->links}) { if ($$_{tag} eq 'img') { push @allLinks, $URI->new_abs($$_{src}, $base); } # do other stuff }
      Being a bit of a Perl n00bie, I did not think to use Data::Dumper. Thanks for expanding my mind. :)
Re: Problems with cruft-finding script
by Abigail-II (Bishop) on Dec 10, 2003 at 17:24 UTC
    my %seen; sub recursiveFollow { my $file = shift; return if $seen {$file} ++; ... }

    Abigail