comment on

This is a common problem and one to which there is no definate answer only various suggested approaches. It is also the bane of my life so I have a degree of sympathy for you. I have spent a lot of time indexing content from the web using a commercial search engine.

Once upon a time the URL of a particular document could be treated as a unique(ish) identifier. Problems arise where you have documents that:

contain the same content but have different URLs (e.g. this page will appear at the .org and .com perlmonks sites).
are generated by a content managment that places some form of additional information in to the URL. e.g. a session identifier.
the author decided should tell you the time ('cos none of us have watches!) and therefore change content slightly every time that you load them.

As I mentioned earlier, there is no easy answer to this question, the best that you can look for is evidence that the documents returned are the same in an effort to detect loops during indexing. I would look for the following:

The static part of the URL - Typically a document management / session management system will create a URL with a static part that allows it to identify the document and a dynamic part for session tracking. If you can identify the static part and use regexes to remove the dynamic part then you can create and track a list of pages.
Look for an alternate piece of evidence, such as the title of the document or an internal ID generated as a Meta tag.
Use CRC or a similar technique to discover documents that have the same content. This technique can be extended to discovering documents that are similar but have a tiny difference (e.g. just having a helpful 'the time is... section).

Of course the most important technique would be to use an existing Spider tool which has all of this built in! The following list of resources culled from my favourites may be of interest:

Search Tools - Contains a number of articles about search engines, indexing tools etc.
Search Engine Watch - More of the same.
Bot Spot - Information on Bots
MOM Spider - An open source spider written in (no prizes for guessing!) Perl

Good Luck

inman

In reply to Re: Infinite loop prevention for spider by inman
in thread Infinite loop prevention for spider by Wassercrats

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.