Problems with cruft-finding script

aarestad has asked for the wisdom of the Perl Monks concerning the following question:

I posted last week about trying to find a solution to a familiar web problem: finding content in a web directory that is not linked to anything in that subweb. The best solution that the Monks came up with is to get the paths of each document I come across and print them out using their full Unix path name. Here's my first attempt (with some links changed to protect the name of my innocent company :):

#!/usr/bin/perl

use strict;
use warnings;

use HTML::LinkExtractor;
use LWP::Simple qw(get);

# Base for searching
my $base = "http://business.intra.company.com";

# Create the LinkExtractor object for use in the subroutine
my $LX = new HTML::LinkExtractor(undef, "$base/it/");

# List of all the links found
my @allLinks;

# Start here on the recursive traversal
recursiveFollow("/it/index.html");

foreach (@allLinks) {
    # Print a list of resources to be used on apollo
    print "/wwwprod/docs/business/docs" . $_->path, "\n";
}

sub recursiveFollow {
    my $file = shift;
    my $html = get("$base$file");

    my @thisDocLinks;

    if (!defined $html) {
        warn "file not found: $base$file\n";
        return;
    }
    # DEBUG
    print "got $base$file\n";
    # /DEBUG

    $LX->parse(\$html);
    for my $link (@{ $LX->links }) {
        next if !defined $$link{href};
        # Stash the link if it's a relative link or it begins with $ba
+se
        # but NOT if it's a "file:///" URI
        if (($$link{href} !~ /^http:/ || $$link{href} =~ /business\.in
+tra/)
            && ($$link{href} !~ /^file:/)) {
            push @allLinks, $$link{href};

            if ($$link{href} =~ /\.html?$/) {
                push @thisDocLinks, $$link{href};
            }
        }
    }

    # Follow each link to an htm/html file found in this file recursiv
+ly
    foreach (@thisDocLinks) {
        recursiveFollow($_->path);
    }
}
[download]

There are two problems, one moderate and one major. The moderate problem is that HTML::LinkExtractor::parse() doesn't seem to pick out IMG locations - this is an important part of my analysis to discover IMG links. Is there a hack I can use to search the $html I get to get the IMG SRC locations?

The major problem is that it get stuck in an infinite loop, seemingly because it's trying to access the same file over and over. Obviously I just want to visit any one page exactly once, but I'm getting this trace when running:

$ ./findorphans.pl
got http://business.intra.company.com/it/index.html
got http://business.intra.company.com/it/meetings/meeting_schedule.htm
file not found: http://business.intra.company.com/meetings/meeting_sch
+edule.htm
file not found: http://business.intra.company.com/associates/associate
+_info.htm
file not found: http://business.intra.company.com/associates/index.htm
+l
got http://business.intra.company.com/it/don/spotlight_winners.html
file not found: http://business.intra.company.com/ask/index.html
file not found: http://business.intra.company.com/meetings/meeting_sch
+edule.htm
file not found: http://business.intra.company.com/associates/associate
+_info.htm
file not found: http://business.intra.company.com/associates/index.htm
+l
got http://business.intra.company.com/it/associates/associate_info.htm
file not found: http://business.intra.company.com/../index.html
got http://business.intra.company.com/index.html
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm

...
[download]

You get the picture. Can any sharp-eyed Monks see what's obviously wrong with this code? (I'm thinking use a hash to prevent duplications, but how?...)

Comment on Problems with cruft-finding script Select or Download Code

Replies are listed 'Best First'.

Re: Problems with cruft-finding script
by PodMaster (Abbot) on Dec 10, 2003 at 17:24 UTC

WHAT'S A LINK-type tag

As a general strategy, when in doubt, Dumper :)

use HTML::LinkExtractor;
use Data::Dumper;

my $input = q{
<blockquote cite="http://crazyinsomniac.perlmonk.org/index2.html">
Now that's some goood feedass (ass back it's all good) %')

</blockquote>
     If <a href="http://perl.com/"> I am a LINK!!! </a>
     <IMG SRC="YODAYODAYODAYODAYODAYODAYODAYODAYOD.png" ALT="It's YODA
+!">

};
my $LX = new HTML::LinkExtractor();

$LX->parse(\$input);

print Dumper($LX->links);
__END__
$VAR1 = [
          {
            'cite' => 'http://crazyinsomniac.perlmonk.org/index2.html'
+,
            '_TEXT' => '<blockquote cite="http://crazyinsomniac.perlmo
+nk.org/index2.html">
Now that\'s some goood feedass (ass back it\'s all good) %\')

</blockquote>',
            'tag' => 'blockquote'
          },
          {
            '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a
+>',
            'href' => 'http://perl.com/',
            'tag' => 'a'
          },
          {
            'alt' => 'It\'s YODA!',
            'src' => 'YODAYODAYODAYODAYODAYODAYODAYODAYOD.png',
            'tag' => 'img'
          }
        ];
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Re: Problems with cruft-finding script

by aarestad (Sexton) on Dec 10, 2003 at 18:12 UTC

foreach(@{$LX->links}) {
  if ($$_{tag} eq 'img') {
    push @allLinks, $URI->new_abs($$_{src}, $base);
  }
  # do other stuff
}
[download]

[reply]
[d/l]

Re: Problems with cruft-finding script
by Abigail-II (Bishop) on Dec 10, 2003 at 17:24 UTC

my %seen;
sub recursiveFollow {
    my $file = shift;
    return if $seen {$file} ++;
    ...
}
[download]

Abigail

[reply]
[d/l]