in reply to Infinite loop prevention for spider

This is a common problem and one to which there is no definate answer only various suggested approaches. It is also the bane of my life so I have a degree of sympathy for you. I have spent a lot of time indexing content from the web using a commercial search engine.

Once upon a time the URL of a particular document could be treated as a unique(ish) identifier. Problems arise where you have documents that:

  1. contain the same content but have different URLs (e.g. this page will appear at the .org and .com perlmonks sites).
  2. are generated by a content managment that places some form of additional information in to the URL. e.g. a session identifier.
  3. the author decided should tell you the time ('cos none of us have watches!) and therefore change content slightly every time that you load them.

As I mentioned earlier, there is no easy answer to this question, the best that you can look for is evidence that the documents returned are the same in an effort to detect loops during indexing. I would look for the following:

  1. The static part of the URL - Typically a document management / session management system will create a URL with a static part that allows it to identify the document and a dynamic part for session tracking. If you can identify the static part and use regexes to remove the dynamic part then you can create and track a list of pages.
  2. Look for an alternate piece of evidence, such as the title of the document or an internal ID generated as a Meta tag.
  3. Use CRC or a similar technique to discover documents that have the same content. This technique can be extended to discovering documents that are similar but have a tiny difference (e.g. just having a helpful 'the time is... section).

Of course the most important technique would be to use an existing Spider tool which has all of this built in! The following list of resources culled from my favourites may be of interest:

Good Luck

inman

  • Comment on Re: Infinite loop prevention for spider

Replies are listed 'Best First'.
A reply falls below the community's threshold of quality. You may see it by logging in.