Dear fellow monks,
I have been looking into the various web scraping frameworks in Perl, and gathered the following ones bit by bit from various Perlmonks discussions and blog posts. I'm listing them here for posterity, but my main purpose is to get the community's feedback on the current status of these and the best way to go about web spidering in modern Perl.
(Note that the comments are quick first impressions and maybe wildly inaccurate, corrections welcome)
Starting from:
- Good old WWW::Mechanize and HTML::TreeBuilder (Mojo::UserAgent and Mojo::DOM seem to be basically equivalent, I haven't tried them).
- Comments: Gets the job done, gives you full control (edit the HTML before parsing if you want, get HTML dumps easily, etc.), but the code ends up quite verbose and boilerplated.
- Scrappy
- Comments: Looks interesting, but the docs are a bit scattered and felt confusing, and the development seems to have stagnated.
- Gungho
- Comments: Looks perfect, with async IO, automatic robots.txt handling and actual built-in logging, but unfortunatley development seems to have stopped here too.
- YADA - just came across it, haven't used it yet.
- Web::Scraper
- Comments: This is what I'm using now, the DSL syntax is nice though a bit under-documented, and I had to peek into the sources quite a bit to either understand or customize many things.
Back to
Meditations