Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I built (borrowing code from here and other sites) a simple robot that traverses our site(s) looking for keywords. I would like to control the depth as it traverses the tree to two or three levels down. I've searched the docs for WWW::Robot, LWP::RobotUA and a few other pages to look for a control on the depth. Do any of the monks have wisdom here? Thanks. Here is the code if it is useful:
#!/usr/bin/perl -w use strict; use WWW::Robot; use LWP::UserAgent; use CGI::Pretty qw(-no_debug :html); use HTML::Entities; $|++; my $keyword = "Perlmonger"; my %pages_w_keyword; my $contents; my @URL = qw(http://mysite1111.com/); sub OK_TO_FOLLOW { my $uri = shift; # URI object, known to be http onl +y for ($uri->host) { return 0 unless /mysite1111/i; } for ($uri->query) { return 0 if defined $_ and length; } for ($uri->path) { return 0 if /^\/(cgi|fors|-)/; return 0 unless /(\.html?|\/)$/; } return 1; } my $robot = WWW::Robot->new ( NAME => 'CleanOurSite', VERSION => '1.0', EMAIL => 'me@myaddress.com', USERAGENT => LWP::UserAgent->new, CHECK_MIME_TYPES => 0, ## VERBOSE => 1, IGNORE_TEXT => 0 ); $robot->env_proxy; $robot->addHook ("follow-url-test" => sub { my ($robot, $hook, $url) = @_; return 0 unless $url->scheme eq 'http'; OK_TO_FOLLOW($url); }); $robot->addHook ("invoke-on-contents" => sub { my ($robot, $hook, $url, $response, $structure) = @_; $contents = $response->content; print "URL = $url\n"; # Debug printing if ($contents =~ /$keyword/) { $pages_w_keyword{$url} = $keyword } }); $robot->run(@URL); for my $k (keys %pages_w_keyword) { print " $k $pages_w_keyword{$k}\n";}

Replies are listed 'Best First'.
Re: Need to limit robot depth using WWW::Robot
by simonm (Vicar) on Jul 31, 2003 at 19:21 UTC

    I would like to control the depth as it traverses the tree to two or three levels down.

    By "two or three levels down," do you mean "links away from the home page", or "levels deep within the site hierarchy," or something else?

    Checking the directory depth in your OK_TO_FOLLOW would be easy enough -- just scan $uri->path for the number of slashes: return 0 if ( tr[/][/] > 3 );

      Ah. Good question. I did mean levels deep within the site hierarchy. And I like the idea of counting slashes. I'll give it a go. Interesting that this is not obviously (to me) intrinsic to WWW::Robot. I expected that acceptable robot behavior suggests a limit to within-site depth, and that this would somehow be part of the package. Either way, I have a solution. Thanks. -Michael