bcrowell2 has asked for the wisdom of the Perl Monks concerning the following question:

O monks,

For my work, I end up every 6 months needing to do a particular task that requires tediously clicking through my employer's web interface to a database to retrieve 100 different data records. I'm interested in automating the process. Strangely, this web interface doesn't seem to use https anywhere, just plain http. A casual search turned up Web::Scraper, as well as a ton of other CPAN modules. Web::Scraper looks cool, but there's not much documentation. I'm also not clear on what modules would make it convenient to handle both GET and POST, as well as cookies. The canonical example that everyone seems to have in mind is ebay auctions, but that only requires going to a particular public URL and retrieving the result, without the need for any POST or cookies. For my application, I need to be able to log in with my username and password via POST, and store a cookie.

Any suggestions? Any good examples of code that does this kind of thing?

Thanks!

Ben
  • Comment on good CPAN modules for webscraping with GET, POST and cookies

Replies are listed 'Best First'.
Re: good CPAN modules for webscraping with GET, POST and cookies
by moritz (Cardinal) on Aug 31, 2008 at 19:58 UTC
    Usually in this case WWW::Mechanize is recommended, which does the cookie handling very well. I have no idea how good the actual data extraction is, though.
Re: good CPAN modules for webscraping with GET, POST and cookies
by Your Mother (Archbishop) on Aug 31, 2008 at 21:28 UTC

    As moritz says, WWW::Mechanize is great. In your case I'd recommend WWW::Selenium which I learned about here from dragonchild and have only experimented with so far. What I've seen is inspriring. Selenium allows you to record a browser session -- the tedious clicking about -- and replay it as a script with all the hook points and editability you'd expect. It also supports JS because you're actually running the browser with the Perl. You can certainly do what you want with Mech too but it would probably take so long to write the script that it would eat up the equivalent of 5 years of "every six months" of annoyance. Selenium might reduce that considerably.

    You probably have an impedance mismatch, though. Scripting the data conversions/scraping is probably a significant effort and the kind of thing that is likely to have bad surprises. Getting direct access to the DB to write plain DBI would be likely to save you a lot of trouble (bad/missing data surprises, reprogramming something for 50 hours to save 5 hours of administrative browsing). Since you are in the company, I'd try to get permission to get at the data directly.

Re: good CPAN modules for webscraping with GET, POST and cookies
by patentattorney (Novice) on Aug 31, 2008 at 23:09 UTC
    I've used WWW::Mechanize more than anything, and it works nicely for a lot of scripting of browsers, LWP::UserAgent does it too, and I've used Selenium previously, which can be very useful if you need to figure out how to get at the part of the website you want and it's too painful to pick through the forms and HTML code.

    "Spidering Hacks" from O'Reilly has a lot of good information about various perl tools, and "Perl & LWP", also from O'Reilly also is good. There's also a good amount of information to be had on the net about these tools, although I've found the two books I've mentioned to be the most useful sources for me, at least given my novitiate state.

    Good luck.