A Spider Tool

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I want to create a simple web spider which will grab all the pages inside a directory and check them for title and meta-tags.

I've spent Far Too Long trying to do it myself, and once again been reminded that we have modules like HTML::Linkextor for a reason.

But now I'm confused about the different Robot and Spider and UA modules.

My task ought to be simple for a Module, but can someone please get me started with a little help?

My pseudocode is this:

give the spider a URL, say www.whereiwork.com/site/
recursively {
  find all linked pages, but only within that directory
}
for each page found {
  print out their titles and any meta-tags we find
}
report any errors following the links
[download]

Thanks in advance.
--

($_='jjjuuusssttt annootthheer
     pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
[download]

Comment on A Spider Tool Select or Download Code

Replies are listed 'Best First'.
Re: A Spider Tool by Aristotle (Chancellor) on Aug 18, 2002 at 00:39 UTC
HTML::LinkExtor does a good amount of the work you want, indeed. To fetch pages you should use LWP::RobotUA. Checking URLs should be easy using URI. For the title and meta tags, HTML::HeadParser is the tool of choice. Finally, your loop will end up looking more like this: `while @url { $page = get ( $url = shift @url ) if(not defined $page) { push @error, $url next } push @url, match_base_url ( extract_links ( $page ) ) push @info, [ extract_header_tags $page ] } print_information for @info print_error_url for @error` [download] Update: ~~LWP::Simple~~ Makeshifts last the longest.	[reply] [d/l]
•Re: A Spider Tool by merlyn (Sage) on Aug 18, 2002 at 09:53 UTC
Well, you can start with my column's code, which should do nearly exactly what you're already looking for. -- Randal L. Schwartz, Perl hacker	[reply]
Re: •Re: A Spider Tool by Cody Pendant (Prior) on Aug 19, 2002 at 04:46 UTC
Thanks, nearly all the stuff I want is there, like you say. I should read your column more often! -- `($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;` [download]	[reply] [d/l]
Re: A Spider Tool by neilwatson (Priest) on Aug 18, 2002 at 03:23 UTC
Please make sure your spider obeys the robots.txt file of any website it visits. Neil Watson watson-wilson.ca	[reply]
Re: Re: A Spider Tool by Cody Pendant (Prior) on Aug 18, 2002 at 03:46 UTC
Thanks for your help. Just to note, the server is our own server, so the respect for robots.txt isn't really an issue. What I need is an automated tool to check sites for basic compliance with our policy, i.e. "all pages must have meta-tags" and "all pages must have a descriptive title". -- `($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;` [download]	[reply] [d/l]