I posted last week about trying to find a solution to a familiar web problem: finding content in a web directory that is not linked to anything in that subweb. The best solution that the Monks came up with is to get the paths of each document I come across and print them out using their full Unix path name. Here's my first attempt (with some links changed to protect the name of my innocent company :):

#!/usr/bin/perl use strict; use warnings; use HTML::LinkExtractor; use LWP::Simple qw(get); # Base for searching my $base = "http://business.intra.company.com"; # Create the LinkExtractor object for use in the subroutine my $LX = new HTML::LinkExtractor(undef, "$base/it/"); # List of all the links found my @allLinks; # Start here on the recursive traversal recursiveFollow("/it/index.html"); foreach (@allLinks) { # Print a list of resources to be used on apollo print "/wwwprod/docs/business/docs" . $_->path, "\n"; } sub recursiveFollow { my $file = shift; my $html = get("$base$file"); my @thisDocLinks; if (!defined $html) { warn "file not found: $base$file\n"; return; } # DEBUG print "got $base$file\n"; # /DEBUG $LX->parse(\$html); for my $link (@{ $LX->links }) { next if !defined $$link{href}; # Stash the link if it's a relative link or it begins with $ba +se # but NOT if it's a "file:///" URI if (($$link{href} !~ /^http:/ || $$link{href} =~ /business\.in +tra/) && ($$link{href} !~ /^file:/)) { push @allLinks, $$link{href}; if ($$link{href} =~ /\.html?$/) { push @thisDocLinks, $$link{href}; } } } # Follow each link to an htm/html file found in this file recursiv +ly foreach (@thisDocLinks) { recursiveFollow($_->path); } }

There are two problems, one moderate and one major. The moderate problem is that HTML::LinkExtractor::parse() doesn't seem to pick out IMG locations - this is an important part of my analysis to discover IMG links. Is there a hack I can use to search the $html I get to get the IMG SRC locations?

The major problem is that it get stuck in an infinite loop, seemingly because it's trying to access the same file over and over. Obviously I just want to visit any one page exactly once, but I'm getting this trace when running:

$ ./findorphans.pl got http://business.intra.company.com/it/index.html got http://business.intra.company.com/it/meetings/meeting_schedule.htm file not found: http://business.intra.company.com/meetings/meeting_sch +edule.htm file not found: http://business.intra.company.com/associates/associate +_info.htm file not found: http://business.intra.company.com/associates/index.htm +l got http://business.intra.company.com/it/don/spotlight_winners.html file not found: http://business.intra.company.com/ask/index.html file not found: http://business.intra.company.com/meetings/meeting_sch +edule.htm file not found: http://business.intra.company.com/associates/associate +_info.htm file not found: http://business.intra.company.com/associates/index.htm +l got http://business.intra.company.com/it/associates/associate_info.htm file not found: http://business.intra.company.com/../index.html got http://business.intra.company.com/index.html got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm got http://business.intra.company.com/aboutbusiness/bus_history.htm ...

You get the picture. Can any sharp-eyed Monks see what's obviously wrong with this code? (I'm thinking use a hash to prevent duplications, but how?...)


In reply to Problems with cruft-finding script by aarestad

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.