Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:
Currently I do this what I think of as "the dumb way", which is, log in via WWW::Mechanize, get() various pages, grab links from one particular chunk of the HTML, append the magic 'print-friendly' request to the query string, get() and save the result -- and it all works well enough. If they change their code, I notice because the download fails, I fiddle with the script a bit, and get back on track.
But years of hanging around with the Monks have made me see that it's a little kludgy. It's not a solution I can easily update, it relies on using regexes not a parser, it's not a solution I could pass to anyone else, and it's not a solution I can generalise (I also like to read nolaS magazine which would prefer me to pay for this service, and I have another script for that).
So, what's the best way to write a scraper? I'm seeing some kind of general scraping module, which would open a particular file or data structure, as in, YNT.scrape would somehow contain a hash with all the details for the YNT, and nolaS.scrape the details for nolaS and so on.
Essentially the problem looks like:
and the solution would be one where you would just swap out the YNT.scrape for a new one, or edit it, rather than have to dig into the code.with (an arbitrary number of starting pages) get (a number of links matching a certain regex or HTML-parsing expression, which might be different for different pages or a default) optionally (adding certain flags to query strings or transforming URLs) save to (a specified HD location with a specified filename)
($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How To Write A Scraper?
by davidrw (Prior) on Jul 03, 2005 at 23:46 UTC | |
by Cody Pendant (Prior) on Jul 03, 2005 at 23:52 UTC | |
|
Re: How To Write A Scraper?
by jpeg (Chaplain) on Jul 03, 2005 at 23:58 UTC | |
by Cody Pendant (Prior) on Jul 04, 2005 at 00:09 UTC | |
by jpeg (Chaplain) on Jul 04, 2005 at 00:25 UTC | |
by Cody Pendant (Prior) on Jul 04, 2005 at 00:46 UTC | |
by Cody Pendant (Prior) on Jul 04, 2005 at 01:05 UTC | |
|
Re: How To Write A Scraper?
by tphyahoo (Vicar) on Jul 04, 2005 at 10:38 UTC | |
by Cody Pendant (Prior) on Jul 05, 2005 at 23:18 UTC |