Currently I do this what I think of as "the dumb way", which is, log in via WWW::Mechanize, get() various pages, grab links from one particular chunk of the HTML, append the magic 'print-friendly' request to the query string, get() and save the result -- and it all works well enough. If they change their code, I notice because the download fails, I fiddle with the script a bit, and get back on track.
But years of hanging around with the Monks have made me see that it's a little kludgy. It's not a solution I can easily update, it relies on using regexes not a parser, it's not a solution I could pass to anyone else, and it's not a solution I can generalise (I also like to read nolaS magazine which would prefer me to pay for this service, and I have another script for that).
So, what's the best way to write a scraper? I'm seeing some kind of general scraping module, which would open a particular file or data structure, as in, YNT.scrape would somehow contain a hash with all the details for the YNT, and nolaS.scrape the details for nolaS and so on.
Essentially the problem looks like:
and the solution would be one where you would just swap out the YNT.scrape for a new one, or edit it, rather than have to dig into the code.with (an arbitrary number of starting pages) get (a number of links matching a certain regex or HTML-parsing expression, which might be different for different pages or a default) optionally (adding certain flags to query strings or transforming URLs) save to (a specified HD location with a specified filename)
($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print
In reply to How To Write A Scraper? by Cody Pendant
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |