Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I want to create a simple web spider which will grab all the pages inside a directory and check them for title and meta-tags.

I've spent Far Too Long trying to do it myself, and once again been reminded that we have modules like HTML::Linkextor for a reason.

But now I'm confused about the different Robot and Spider and UA modules.

My task ought to be simple for a Module, but can someone please get me started with a little help?

My pseudocode is this:

give the spider a URL, say www.whereiwork.com/site/ recursively { find all linked pages, but only within that directory } for each page found { print out their titles and any meta-tags we find } report any errors following the links

Thanks in advance.
--

($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

Replies are listed 'Best First'.
Re: A Spider Tool
by Aristotle (Chancellor) on Aug 18, 2002 at 00:39 UTC
    HTML::LinkExtor does a good amount of the work you want, indeed. To fetch pages you should use LWP::RobotUA. Checking URLs should be easy using URI. For the title and meta tags, HTML::HeadParser is the tool of choice. Finally, your loop will end up looking more like this:
    while @url { $page = get ( $url = shift @url ) if(not defined $page) { push @error, $url next } push @url, match_base_url ( extract_links ( $page ) ) push @info, [ extract_header_tags $page ] } print_information for @info print_error_url for @error
    Update: LWP::Simple

    Makeshifts last the longest.

•Re: A Spider Tool
by merlyn (Sage) on Aug 18, 2002 at 09:53 UTC
      Thanks, nearly all the stuff I want is there, like you say. I should read your column more often!
      --
      ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
Re: A Spider Tool
by neilwatson (Priest) on Aug 18, 2002 at 03:23 UTC
    Please make sure your spider obeys the robots.txt file of any website it visits.

    Neil Watson
    watson-wilson.ca

      Thanks for your help.

      Just to note, the server is our own server, so the respect for robots.txt isn't really an issue. What I need is an automated tool to check sites for basic compliance with our policy, i.e. "all pages must have meta-tags" and "all pages must have a descriptive title".
      --

      ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;