kdt2006 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need to create a spider to crawl some literature sites and return a list of new articles on a given subject. Originally I was hoping to do this using the google SOAP API, but unfortunately they seem to have stop handing keys out a while ago. I will now no doubt have to trawl the individual sites. Does anyone have any pointers on this.

Cheers

  • Comment on web spider for searching multiple sites

Replies are listed 'Best First'.
Re: web spider for searching multiple sites
by erroneousBollock (Curate) on Sep 06, 2007 at 01:38 UTC
    Use the obvious (LWP or Mechanize) modules to fetch the content... craft a plugin system (1 per module) to deal with the content of the fetched sites.

    Make sure the plugin API deals efficiently with the terms you're most concerned about in the pages to be scraped.

    -David

Re: web spider for searching multiple sites
by mmmmtmmmm (Monk) on Sep 06, 2007 at 10:16 UTC
    Two good books have been written about this -- take a look at Spidering Hacks and Perl & LWP.

    If you want something online, check this out: LWP Tutorial

    ----mmmmtmmmm