Re: Web scraping toolkit?

Personally, I also wrote App::scrape to hide away my extraction library consisting of HTML::TreeBuilder::XPath and HTML::TokeParser.

But that library only deals with convenient extraction from HTML, not with the navigation etc.

I like the navigation and extraction API of WWW::Mechanize::Firefox, which is mostly a combination of the APIs of HTML::TreeBuilder::XPath and the API of WWW::Mechanize. Most likely, this sympathy is because I'm the author of that module.

The best approach to a simplicistic boilerplate approach I've seen is Querylet, which is a source filter that describes DBI reports. Maybe you can reformulate your extractions in a language like it. I wrote (but never used in production so far) a source-filter-less, pluggable version of Querylet at https://github.com/Corion/querylet/tree/pluggable, so if you dislike source filters but like the general language format, you can maybe reuse that parser instead.

Comment on Re: Web scraping toolkit?

Replies are listed 'Best First'.
Re^2: Web scraping toolkit? by mzedeler (Pilgrim) on Jan 27, 2012 at 08:44 UTC
I think that App::scrape may turn out to be insufficient, not covering some edge cases that needs handling. But again - thats my general worry, not having tried any of the scraping modules yet (the same goes for Web::Scraper and Scrappy). WWW::Mechanize::Firefox looks very promising, and implementing the few extra features that Scrapie has (logging and such) shouldn't be a problem. The real drawback lies in having to rely on firefox (or some similar component) in development and production. I'll go back to the drawing board and see what to do. Thanks for the pointers.	[reply]

Replies are listed 'Best First'.

Re^2: Web scraping toolkit?
by mzedeler (Pilgrim) on Jan 27, 2012 at 08:44 UTC

[reply]