saikola has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I want to write a program which crawls Web Pages. While going through CPAN i came across following modules which helps me to do the same job.

->WWW::Mechanize
->WWW::Spyder
->HTML::TokeParser, LWP::Simple

Now i am confused of which module to be used i.e. Whether to use Mechanize module or Spyder module or combination of HTML::TokeParser,LWP::Simple.

Can please anybody tell me which will be the best module(Efficiency wise) to Crawl Web Pages.

Please let me know if any other module is better than the above specified modules.

Thanks in Advance

Replies are listed 'Best First'.
Re: Help on Crawling
by InfiniteLoop (Hermit) on Mar 31, 2005 at 06:47 UTC
Re: Help on Crawling
by inman (Curate) on Mar 31, 2005 at 07:40 UTC
    Spidering websites is a difficult and complex task. It is also a problem that has been solved many times before. My suggestion would be to do some research to find and then re-use or extend a previously written spider. e.g. MOMspider

    Also check out the SearchTools page on Robots and Spiders for inspiration and links. The Indexing Robot Crawlers Checklist will be useful if you decide to write your own code.

Re: Help on Crawling
by Joost (Canon) on Mar 31, 2005 at 21:54 UTC
    I've never used WWW::Spyder, so I can't comment on that one.

    It all depends on what you really want to do with the site(s). If you need to fill in forms and generally interact with a dynamic system, WWW::Mechanize is the best choice. LWP::UserAgent is a bit more low-level. WWW::Mechanize is actually a superclass of LWP::UserAgent, so you can still use all the tricks LWP::UserAgent can do with WWW::Mechanize, but you'll have a (slight) performance hit because WWW::Mechanize already parses the HTML for forms and links even if you don't need that information.

    IIRC HTML::TokeParser doesn't do HTTP retrieval, so on its own it's not enough to craw web-pages.

    I'd probably recommend WWW::Mechanize, unless you have a really specific use for your spider that doesn't fit WWW::Mechanize, and you need the performance benefits.