How To Write A Scraper?

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I use Perl to get (a subset of) a well-known newspaper to read on my Palm. Let's call it the Yew Nork Times.

Currently I do this what I think of as "the dumb way", which is, log in via WWW::Mechanize, get() various pages, grab links from one particular chunk of the HTML, append the magic 'print-friendly' request to the query string, get() and save the result -- and it all works well enough. If they change their code, I notice because the download fails, I fiddle with the script a bit, and get back on track.

But years of hanging around with the Monks have made me see that it's a little kludgy. It's not a solution I can easily update, it relies on using regexes not a parser, it's not a solution I could pass to anyone else, and it's not a solution I can generalise (I also like to read nolaS magazine which would prefer me to pay for this service, and I have another script for that).

So, what's the best way to write a scraper? I'm seeing some kind of general scraping module, which would open a particular file or data structure, as in, YNT.scrape would somehow contain a hash with all the details for the YNT, and nolaS.scrape the details for nolaS and so on.

Essentially the problem looks like:

with       (an arbitrary number of starting pages)
get        (a number of links matching a certain regex or
            HTML-parsing expression, which might be 
            different for different pages or a default)
optionally (adding certain flags to query strings
            or transforming URLs)
save to    (a specified HD location with a specified 
            filename)
[download]

and the solution would be one where you would just swap out the YNT.scrape for a new one, or edit it, rather than have to dig into the code.

($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Comment on How To Write A Scraper? Select or Download Code

Replies are listed 'Best First'.
Re: How To Write A Scraper? by davidrw (Prior) on Jul 03, 2005 at 23:46 UTC
Some thoughts/comments: does a rss feed fit your need? (http://www.nytimes.com/services/xml/rss/) You can just loop over the links adding the 'print-friendly' part to the query string, and save them all locally -- saves you some of the traversing ... (or does your palm have a rss reader?) WWW::Mechanize recommends Spidering Hacks from O'Reilly be sure to take full advantage of the abilities of the find_all_links() method of WWW::Mechanize There appears to be a scraper framework here: WWW::Scraper (and described here: WWW::ScraperPOD)	[reply]
Re^2: How To Write A Scraper? by Cody Pendant (Prior) on Jul 03, 2005 at 23:52 UTC
RSS: The NYT is very parsimonious about what it gives out on those feeds -- five or six stories, and an annoying amount of crossover (between Arts and Books for instance you might get 12 links to the same 9 stories). Scrapers: that framework seems designed for search engines only. Although I suppose it might address many of the same ideas. Good points about find_all_links and the O'Reilly thing, thanks. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re: How To Write A Scraper? by jpeg (Chaplain) on Jul 03, 2005 at 23:58 UTC
I don't think there can be a general scraping module. There probably aren't as many as many ways to lay out an HTML page as there are sites out there, but it's close. With our current tools, you simply need to know the HTML of a page before you can parse it. That's the only guaranteed way of getting what you want. That said, there are workarounds: see the last time this question came up. -- jpg	[reply]
Re^2: How To Write A Scraper? by Cody Pendant (Prior) on Jul 04, 2005 at 00:09 UTC
I don't disagree that you need to know the HTML before you start. It was more about creating a scraper object, not using procedural code, in which the part that "knows the HTML" is a neat sub-section, using a regex or instructions for a parser. So rather than have script A which says `# ... having got to a certain page $mech->content() =~ m/LONGFIDDLYCAPTURINGREGEXHERE/; my $place_where_the_links_are = $1; # get the links from that place and continue` [download] and another script B which does the same, and another script C with yet another, and so on, at least I could push the complexity down into a module, object etc and not have to see it, and not have to write multiple scripts for multiple publications. And the moment the NYT changes their HTML, someone could figure out the new regex and update Scraper::Newspapers::NYT or whatever it would be. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply] [d/l]
Re^3: How To Write A Scraper? by jpeg (Chaplain) on Jul 04, 2005 at 00:25 UTC
Oh, I see....I think. At the moment, it sounds like you're describing a collection of site-specific parsers, like the Finance::Quote tree, perhaps? -- jpg	[reply]
Re^4: How To Write A Scraper? by Cody Pendant (Prior) on Jul 04, 2005 at 00:46 UTC
Re^4: How To Write A Scraper? by Cody Pendant (Prior) on Jul 04, 2005 at 01:05 UTC
Re: How To Write A Scraper? by tphyahoo (Vicar) on Jul 04, 2005 at 10:38 UTC
I've been doing a lot of scraping, and what works for me is: -- HTML::Treebuilder -- Inheritance -- Every scrape operation in its own module, and each module gets tested separately, against test data, to make sure the right thing gets extracted. This uses Test::Harness.	[reply]
Re^2: How To Write A Scraper? by Cody Pendant (Prior) on Jul 05, 2005 at 23:18 UTC
Have you considered publicising what you've done, or making it available for use or comment? ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]