Personally, I also wrote App::scrape to hide away my extraction library consisting of HTML::TreeBuilder::XPath and HTML::TokeParser.
But that library only deals with convenient extraction from HTML, not with the navigation etc.
I like the navigation and extraction API of WWW::Mechanize::Firefox, which is mostly a combination of the APIs of HTML::TreeBuilder::XPath and the API of WWW::Mechanize. Most likely, this sympathy is because I'm the author of that module.
The best approach to a simplicistic boilerplate approach I've seen is Querylet, which is a source filter that describes DBI reports. Maybe you can reformulate your extractions in a language like it. I wrote (but never used in production so far) a source-filter-less, pluggable version of Querylet at https://github.com/Corion/querylet/tree/pluggable, so if you dislike source filters but like the general language format, you can maybe reuse that parser instead.
In reply to Re: Web scraping toolkit?
by Corion
in thread Web scraping toolkit?
by mzedeler
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |