Re: Infinite loop prevention for spider
by Abigail-II (Bishop) on Nov 09, 2003 at 14:38 UTC
|
This is impossible to determine from the client side.
Suppose you are playing a text adventure, and you find
yourself in a maze. All rooms have the same description.
Just based on the description, you do not know whether
you have been there before or not. And even if you remember
all the pages, and say "if two pages have the same content,
I consider them to be the same, even if the URLs differ",
you can have a problem - for instance, the page may contain a
'counter' or a timestamp, making that the content is different
each time.
You might be able to come up with some heuristics, but then
you will have to accept that you will have false positives
and false negatives. And make sure you check a sites
robots.txt - that should prevent a spider from getting into
a loop.
Off course, your question has nothing to do with Perl. You'd
have to solve the same problems if you'd used any other
language.
Abigail | [reply] |
|
|
The solution, then, is to start spidering with a large inventory of items (ie, a shovel, perhaps some miscellaneous treasure). As you spider each page, drop one of your inventory items into that page. Then when you visit a page again, you can tell which one it is by which inventory item is there.
Oh, and make sure your spider has a lantern, or else it is likely to be eaten by a grue...
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in.
|
|
|
| [reply] |
•Re: Infinite loop prevention for spider
by merlyn (Sage) on Nov 09, 2003 at 16:04 UTC
|
As an experiment, for a while I had a link at http://www.stonehenge.com that consisted solely of -/, and I put a symlink in the web directory linking "-" to ".". That means that you could address any page on my site with an arbitrary number of "/-" throwaways, such as "/-/-/-/-/merlyn/columns.html".
I did this to see what kind of similar-duplicate rejection algorithms the big indexing spiders use. Most of them recognized rather quickly that the pages were duplicate pages, but NorthernLights had indexed about 20 levels deep of the same pages before I turned the link off. Bleh!
| [reply] |
Re: Infinite loop prevention for spider
by inman (Curate) on Nov 10, 2003 at 09:17 UTC
|
This is a common problem and one to which there is no definate answer only various suggested approaches. It is also the bane of my life so I have a degree of sympathy for you. I have spent a lot of time indexing content from the web using a commercial search engine.
Once upon a time the URL of a particular document could be treated as a unique(ish) identifier. Problems arise where you have documents that:
- contain the same content but have different URLs (e.g. this page will appear at the .org and .com perlmonks sites).
- are generated by a content managment that places some form of additional information in to the URL. e.g. a session identifier.
- the author decided should tell you the time ('cos none of us have watches!) and therefore change content slightly every time that you load them.
As I mentioned earlier, there is no easy answer to this question, the best that you can look for is evidence that the documents returned are the same in an effort to detect loops during indexing. I would look for the following:
- The static part of the URL - Typically a document management / session management system will create a URL with a static part that allows it to identify the document and a dynamic part for session tracking. If you can identify the static part and use regexes to remove the dynamic part then you can create and track a list of pages.
- Look for an alternate piece of evidence, such as the title of the document or an internal ID generated as a Meta tag.
- Use CRC or a similar technique to discover documents that have the same content. This technique can be extended to discovering documents that are similar but have a tiny difference (e.g. just having a helpful 'the time is... section).
Of course the most important technique would be to use an existing Spider tool which has all of this built in! The following list of resources culled from my favourites may be of interest:
Good Luck
inman
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
Re: Infinite loop prevention for spider
by Corion (Patriarch) on Nov 10, 2003 at 09:25 UTC
|
I saw a talk by the author of String::Trigram, and he mentioned that he used his module for a similar problem, determining whether a webpage had changed or not. If you tune your similarity threshold good enough, this could be another measure for "page similarity" respectively "These two urls are the same page".
perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The
$d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider
($c = $d->accept())->get_request(); $c->send_response( new #in the
HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
| [reply] [d/l] |