I saw a talk by the author of String::Trigram, and he mentioned that he used his module for a similar problem, determining whether a webpage had changed or not. If you tune your similarity threshold good enough, this could be another measure for "page similarity" respectively "These two urls are the same page".
perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
In reply to Re: Infinite loop prevention for spider
by Corion
in thread Infinite loop prevention for spider
by Wassercrats
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |