Infinite loop prevention for spider

Wassercrats has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Infinite loop prevention for spider by Abigail-II (Bishop) on Nov 09, 2003 at 14:38 UTC
This is impossible to determine from the client side. Suppose you are playing a text adventure, and you find yourself in a maze. All rooms have the same description. Just based on the description, you do not know whether you have been there before or not. And even if you remember all the pages, and say "if two pages have the same content, I consider them to be the same, even if the URLs differ", you can have a problem - for instance, the page may contain a 'counter' or a timestamp, making that the content is different each time. You might be able to come up with some heuristics, but then you will have to accept that you will have false positives and false negatives. And make sure you check a sites robots.txt - that should prevent a spider from getting into a loop. Off course, your question has nothing to do with Perl. You'd have to solve the same problems if you'd used any other language. Abigail	[reply]
Re: Re: Infinite loop prevention for spider by sgifford (Prior) on Nov 10, 2003 at 04:58 UTC
The solution, then, is to start spidering with a large inventory of items (ie, a shovel, perhaps some miscellaneous treasure). As you spider each page, drop one of your inventory items into that page. Then when you visit a page again, you can tell which one it is by which inventory item is there. Oh, and make sure your spider has a lantern, or else it is likely to be eaten by a grue...	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Re: Infinite loop prevention for spider by Wassercrats (Initiate) on Nov 09, 2003 at 15:12 UTC
Yes, I thought of the possible non-link time stamp issue. My current bot deletes all the URLs for various comparisons, but that might not be enough. I wonder what the typical way of dealing with this is. There is a new O'Reilly book out called Spidering Hacks. I hope I could find it in a book store near me (I'm not certain enough it would be helpful to shell out the money, sight unseen). And I hope people put the proper entries in their robots.txt files! Thanks	[reply]
•Re: Infinite loop prevention for spider by merlyn (Sage) on Nov 09, 2003 at 16:04 UTC
As an experiment, for a while I had a link at http://www.stonehenge.com that consisted solely of `-/`, and I put a symlink in the web directory linking "-" to ".". That means that you could address any page on my site with an arbitrary number of "/-" throwaways, such as "/-/-/-/-/merlyn/columns.html". I did this to see what kind of similar-duplicate rejection algorithms the big indexing spiders use. Most of them recognized rather quickly that the pages were duplicate pages, but NorthernLights had indexed about 20 levels deep of the same pages before I turned the link off. Bleh! -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: Infinite loop prevention for spider by inman (Curate) on Nov 10, 2003 at 09:17 UTC
This is a common problem and one to which there is no definate answer only various suggested approaches. It is also the bane of my life so I have a degree of sympathy for you. I have spent a lot of time indexing content from the web using a commercial search engine. Once upon a time the URL of a particular document could be treated as a unique(ish) identifier. Problems arise where you have documents that: contain the same content but have different URLs (e.g. this page will appear at the .org and .com perlmonks sites). are generated by a content managment that places some form of additional information in to the URL. e.g. a session identifier. the author decided should tell you the time ('cos none of us have watches!) and therefore change content slightly every time that you load them. As I mentioned earlier, there is no easy answer to this question, the best that you can look for is evidence that the documents returned are the same in an effort to detect loops during indexing. I would look for the following: The static part of the URL - Typically a document management / session management system will create a URL with a static part that allows it to identify the document and a dynamic part for session tracking. If you can identify the static part and use regexes to remove the dynamic part then you can create and track a list of pages. Look for an alternate piece of evidence, such as the title of the document or an internal ID generated as a Meta tag. Use CRC or a similar technique to discover documents that have the same content. This technique can be extended to discovering documents that are similar but have a tiny difference (e.g. just having a helpful 'the time is... section). Of course the most important technique would be to use an existing Spider tool which has all of this built in! The following list of resources culled from my favourites may be of interest: Search Tools - Contains a number of articles about search engines, indexing tools etc. Search Engine Watch - More of the same. Bot Spot - Information on Bots MOM Spider - An open source spider written in (no prizes for guessing!) Perl Good Luck inman	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Infinite loop prevention for spider by Corion (Patriarch) on Nov 10, 2003 at 09:25 UTC
I saw a talk by the author of String::Trigram, and he mentioned that he used his module for a similar problem, determining whether a webpage had changed or not. If you tune your similarity threshold good enough, this could be another measure for "page similarity" respectively "These two urls are the same page". `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l]