Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Web Spidering Ajax Sites

by awohld (Hermit)
on Sep 14, 2007 at 02:11 UTC ( #638942=perlquestion: print w/replies, xml ) Need Help??

awohld has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking to create a webspider that submits and gets back data from an Ajax enabled website. I'd like it to run 100% out of my Perl script using 100% Perl.

Is this possible? If so can you give me some hints on how it's done?

I've scraped some sites that use a lot of java and it was basically decoding JavaScript and modifying POST/GET data. This isn't the same approach with Ajax, is it?

I saw Selenium demonstrated at YPAC::2006::NA and it can do the job. I'd like solution to be a UNIX only command line version using 100% perl if possible. My server won't have GNOME or similar installed.

Any direction on this would be greatly appreciated.

Replies are listed 'Best First'.
Re: Web Spidering Ajax Sites
by perrin (Chancellor) on Sep 14, 2007 at 02:27 UTC
    Ajax is just JavaScript. If all you want to do is run a certain sequence of requests and capture the results, HTTP::Recorder and WWW::Mechanize will work fine. Just set up the proxy script that comes with HTTP::Recorder, make the requests in your browser, and Recorder will turn that into a Mechanize script that behaves the same as the Ajax code.

      well maybe the faqis outdated but mech's author says it does not play well will javascript (as it does not have an engine). Are you saying the situation has changed?

      cheers --stephan

        That information is correct, but totally irrelevant. Mech has no support for JavaScript, but the server doesn't know that. If you wanted to actually execute some JavaScript code, Mech can't do it, but all you want to do is talk to the server as if you were a browser (with JavaScript), and Mech can do that.

        There is nothing that JavaScript can make a browser send to the server that you can't mimic with Mech. The only hard part is figuring out exactly what the JavaScript would send, and using HTTP::Recorder with your browser (or using some other means of looking at the requests, like LiveHTTPHeaders) solves that for you.

Re: Web Spidering Ajax Sites
by Gangabass (Vicar) on Sep 14, 2007 at 02:54 UTC

    AJAX is just some request to server from JavaScript code so you can do this from Perl as well. But first you must realize which request do you need. I use for that FireFox LiveHTTPHeaders Extension.

Re: Web Spidering Ajax Sites
by erroneousBollock (Curate) on Sep 14, 2007 at 05:35 UTC
    Spidering in the classic sense? No, not without having your spidering code magically figure out what the Javascript might do.

    As others have said, you (the programmer) can figure out what the Javascript does (or record it with a proxy) and then have your perl code do that.


Re: Web Spidering Ajax Sites
by Joost (Canon) on Sep 15, 2007 at 00:13 UTC
    Selenium works by running the html/code through a javascript enabled browser.

    Your question seems to be: is it possible to emulate a javascript-enabled browser in 100% pure perl, then the answer is yes.

    The catch is that no-one has written anything even close to doing that. Even given working HTTP/WWW and JavaScript libraries it's very far from trivial to cook up a working/scriptable DOM model that can be used from the JavaScript code and is compatible with most current websites, or even a small subset of most websites. And I've tried. :-)

Re: Web Spidering Ajax Sites
by runrig (Abbot) on Sep 15, 2007 at 00:20 UTC
    Selenium would be slower than the equivalent WWW::Mech solution, so if you can look at the JavaScript and figure out what requests are actually being sent, it might be worth it (I have done it on some web sites). But Selenium would probably be easier to deal with if figuring out the JavaScript is hard (I actually started to use WET and WATIR, but gave up after I figured out the JavaScript).

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://638942]
Approved by ww
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2023-12-09 18:56 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (38 votes). Check out past polls.