Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re: Web Spidering Ajax Sites

by perrin (Chancellor)
on Sep 14, 2007 at 02:27 UTC ( #638943=note: print w/replies, xml ) Need Help??

in reply to Web Spidering Ajax Sites

Ajax is just JavaScript. If all you want to do is run a certain sequence of requests and capture the results, HTTP::Recorder and WWW::Mechanize will work fine. Just set up the proxy script that comes with HTTP::Recorder, make the requests in your browser, and Recorder will turn that into a Mechanize script that behaves the same as the Ajax code.

Replies are listed 'Best First'.
Re^2: Web Spidering Ajax Sites
by sgt (Deacon) on Sep 14, 2007 at 22:10 UTC

    well maybe the faqis outdated but mech's author says it does not play well will javascript (as it does not have an engine). Are you saying the situation has changed?

    cheers --stephan

      That information is correct, but totally irrelevant. Mech has no support for JavaScript, but the server doesn't know that. If you wanted to actually execute some JavaScript code, Mech can't do it, but all you want to do is talk to the server as if you were a browser (with JavaScript), and Mech can do that.

      There is nothing that JavaScript can make a browser send to the server that you can't mimic with Mech. The only hard part is figuring out exactly what the JavaScript would send, and using HTTP::Recorder with your browser (or using some other means of looking at the requests, like LiveHTTPHeaders) solves that for you.

        Yes I agree completely with your second paragraph but the point I was trying to make was that the OP was asking about how to deal with javascript, and possibly what extra needed to be done with AJAX.

        So if your web-scraper wants to deal with content (for some definition of web scraping), what do you do if the server sends back some kind of serialized data that only a true js engine can decode...

        cheers --stephan

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://638943]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2023-12-06 10:54 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (30 votes). Check out past polls.