agent00013 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a program that verifies HTML links and then spiders through the site and checks the links on all the subsequent pages it turns up. I'm using HTML::LinkExtor and LWP::Simple to verify the links. My problem comes with implementing a spider that doesn't end up in an infinite loop. Any ideas for keeping track of visited pages and how to avoid returning to them?

Replies are listed 'Best First'.
(Ovid) Re: Verifying Links in HTML
by Ovid (Cardinal) on Jun 13, 2001 at 23:51 UTC

    I would use a hash to store each visited link and use the exists function before visiting a link. The main issues would be:

    • You probably want to ensure that you have absolute links (not relative ones), in the hash.
    • Ensure that any links that point to scripts have the query string stripped. Rough guess as to the best way to do that:
      $link =~ s/\?.*$//; $visited{ $link } = '';

    Without knowing how you're getting into the infinite loop, I can only offer one other suggestion: for the %visited hash, you may have a problem if a redirection occurs as you'll resolve to the wrong URL. If that's likely to be an issue, you'll have to use LWP::UserAgent and examine the response code.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Verifying Links in HTML
by Masem (Monsignor) on Jun 13, 2001 at 23:49 UTC
    Use a hash. If the link already exists in the hash, you can skip it. Otherwise process the link and add it to the hash.


    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: Verifying Links in HTML
by perigeeV (Hermit) on Jun 14, 2001 at 02:19 UTC

    For what its worth, linkcheck is available on CPAN. I love CPAN. CPAN makes me happy.

Re: Verifying Links in HTML
by Hero Zzyzzx (Curate) on Jun 13, 2001 at 23:50 UTC

    A general question gets a general answer. . .
    Are you going to be checking static or dynamic pages? This "infinite loop" thing it probably more of an issue for dynamic pages, frequently there are multiple ways of looking at a page (normal, printable, lo-bandwidth, whatever) so you'd need to take this into account.

    If you're spidering static pages, what about dumping the address of the page into a DB, and then checking a new page to verify against the DB. If the page exists, then skip it.

    If they are dynamic pages, and each document has a unique id of some sort, you could parse out the ID and store that in the DB. Then the multiple views thing would be moot, because the ID is what you're matching on.

    update:A hash would probably be better speed wise. With a DB you could have the invalid links stored there and report on them later easily.

Re: Verifying Links in HTML
by agent00013 (Pilgrim) on Jun 14, 2001 at 00:07 UTC
    The pages I'm checking are static. I was attempting to use an array to keep track of visited links but then I was having problems when testing to see if the page was already in there, etc. Something wrong with my regular expressions. I think the hash will work better though. Thanx for your suggestions.

    For a string comprised of "http://www.whatever.html/directory/" how would I right a regular expression that'd make this equivalent to "http://www.whatever.html/directory/index.html"? Essentially I'd like to do this so that it doesn't test the same file twice with different names.

      $string = 'http://www.whatever.html/directory/'; $string .= 'index.html' if(substr($string, -1) eq '/');

      bbfu
      Seasons don't fear The Reaper.
      Nor do the wind, the sun, and the rain.
      We can be like they are.