in reply to Infinite loop prevention for spider

This is impossible to determine from the client side. Suppose you are playing a text adventure, and you find yourself in a maze. All rooms have the same description. Just based on the description, you do not know whether you have been there before or not. And even if you remember all the pages, and say "if two pages have the same content, I consider them to be the same, even if the URLs differ", you can have a problem - for instance, the page may contain a 'counter' or a timestamp, making that the content is different each time.

You might be able to come up with some heuristics, but then you will have to accept that you will have false positives and false negatives. And make sure you check a sites robots.txt - that should prevent a spider from getting into a loop.

Off course, your question has nothing to do with Perl. You'd have to solve the same problems if you'd used any other language.

Abigail

  • Comment on Re: Infinite loop prevention for spider

Replies are listed 'Best First'.
Re: Re: Infinite loop prevention for spider
by sgifford (Prior) on Nov 10, 2003 at 04:58 UTC

    The solution, then, is to start spidering with a large inventory of items (ie, a shovel, perhaps some miscellaneous treasure). As you spider each page, drop one of your inventory items into that page. Then when you visit a page again, you can tell which one it is by which inventory item is there.

    Oh, and make sure your spider has a lantern, or else it is likely to be eaten by a grue...

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Re: Infinite loop prevention for spider
by Wassercrats (Initiate) on Nov 09, 2003 at 15:12 UTC
    Yes, I thought of the possible non-link time stamp issue. My current bot deletes all the URLs for various comparisons, but that might not be enough. I wonder what the typical way of dealing with this is.

    There is a new O'Reilly book out called Spidering Hacks. I hope I could find it in a book store near me (I'm not certain enough it would be helpful to shell out the money, sight unseen). And I hope people put the proper entries in their robots.txt files!

    Thanks