comment on

I posted last week about trying to find a solution to a familiar web problem: finding content in a web directory that is not linked to anything in that subweb. The best solution that the Monks came up with is to get the paths of each document I come across and print them out using their full Unix path name. Here's my first attempt (with some links changed to protect the name of my innocent company :):

#!/usr/bin/perl

use strict;
use warnings;

use HTML::LinkExtractor;
use LWP::Simple qw(get);

# Base for searching
my $base = "http://business.intra.company.com";

# Create the LinkExtractor object for use in the subroutine
my $LX = new HTML::LinkExtractor(undef, "$base/it/");

# List of all the links found
my @allLinks;

# Start here on the recursive traversal
recursiveFollow("/it/index.html");

foreach (@allLinks) {
    # Print a list of resources to be used on apollo
    print "/wwwprod/docs/business/docs" . $_->path, "\n";
}

sub recursiveFollow {
    my $file = shift;
    my $html = get("$base$file");

    my @thisDocLinks;

    if (!defined $html) {
        warn "file not found: $base$file\n";
        return;
    }
    # DEBUG
    print "got $base$file\n";
    # /DEBUG

    $LX->parse(\$html);
    for my $link (@{ $LX->links }) {
        next if !defined $$link{href};
        # Stash the link if it's a relative link or it begins with $ba
+se
        # but NOT if it's a "file:///" URI
        if (($$link{href} !~ /^http:/ || $$link{href} =~ /business\.in
+tra/)
            && ($$link{href} !~ /^file:/)) {
            push @allLinks, $$link{href};

            if ($$link{href} =~ /\.html?$/) {
                push @thisDocLinks, $$link{href};
            }
        }
    }

    # Follow each link to an htm/html file found in this file recursiv
+ly
    foreach (@thisDocLinks) {
        recursiveFollow($_->path);
    }
}
[download]

There are two problems, one moderate and one major. The moderate problem is that HTML::LinkExtractor::parse() doesn't seem to pick out IMG locations - this is an important part of my analysis to discover IMG links. Is there a hack I can use to search the $html I get to get the IMG SRC locations?

The major problem is that it get stuck in an infinite loop, seemingly because it's trying to access the same file over and over. Obviously I just want to visit any one page exactly once, but I'm getting this trace when running:

$ ./findorphans.pl
got http://business.intra.company.com/it/index.html
got http://business.intra.company.com/it/meetings/meeting_schedule.htm
file not found: http://business.intra.company.com/meetings/meeting_sch
+edule.htm
file not found: http://business.intra.company.com/associates/associate
+_info.htm
file not found: http://business.intra.company.com/associates/index.htm
+l
got http://business.intra.company.com/it/don/spotlight_winners.html
file not found: http://business.intra.company.com/ask/index.html
file not found: http://business.intra.company.com/meetings/meeting_sch
+edule.htm
file not found: http://business.intra.company.com/associates/associate
+_info.htm
file not found: http://business.intra.company.com/associates/index.htm
+l
got http://business.intra.company.com/it/associates/associate_info.htm
file not found: http://business.intra.company.com/../index.html
got http://business.intra.company.com/index.html
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm
got http://business.intra.company.com/aboutbusiness/bus_history.htm

...
[download]

You get the picture. Can any sharp-eyed Monks see what's obviously wrong with this code? (I'm thinking use a hash to prevent duplications, but how?...)

In reply to Problems with cruft-finding script by aarestad

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.