Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: Scraping Ajax / JS pop-up

by Monk-E (Initiate)
on Feb 15, 2012 at 21:54 UTC ( [id://954090]=note: print w/replies, xml ) Need Help??


in reply to Re: Scraping Ajax / JS pop-up
in thread Scraping Ajax / JS pop-up

Thanks for the reply.

1. I read through the info on HTTP::Recorder. It seemed promising (its cool in its own right... thanks for the link), but unfortunately, it doesn't support JavaScript interactions (as mentioned there and on CPAN).

2. The problems I mentioned having with mechanizing button clicks on this page using WWW::Mechanize look to be the same for WWW::Mechanize::Firefox, which inherits from it. CPAN shows the methods for button clicks expect them to be within a form. But if I understand your suggestion, its to use the Firefox plugin (LiveHttpHeaders) to avoid any button clicks / AJAX/JS calls, and just make the equivalent http requests directly from my scrape-bot code. (?)

I'll def give it a try, but I'm disappointed that it sounds there isn't any way to do this strictly with Perl modules without needing 3rd party software to "cheat" to avoid the JS/AJAX. Seems like something should be out there to do this with Perl, as I'm certainly not the first to want to scrape such a page.

Thanks again, I'll still try your suggested work-around.

Replies are listed 'Best First'.
Re^3: Scraping Ajax / JS pop-up
by kino (Initiate) on Feb 16, 2012 at 02:01 UTC

    Monk-E, everything that JavaScript does is client-side, so it has to send a request back to the server at some point, and you can just use Perl to mimic that request. If you want a solution that completely automates JavaScript in any situation then you need to make a script that parses JavaScript, which would take an enormous amount of time.
    Instead you might,for example, download firebug for FireFox, right click and Inspect the Element of the particular button you're wandering about and see what address is being requested from the server, use regular expressions to isolate that address and then just have your script get() that address and continue.

    He picked the perl from the dying flesh and held it in his palm, and he turned it over and saw that its curve was perfect in the hand he had smashed against the gate; the torn flesh of the knuckles was turned grayish white by the sea water. ~Steinbeck's "The Perl"
      Thanks. There is a theme here. I'll definitely look into this direction. Guess I had hope that the module plug-ins for handling JavaScript would be able to handle this (and perhaps I was overlooking a powerful module or function or misusing). Thanks for a pointer in the right direction.
Re^3: Scraping Ajax / JS pop-up
by Anonymous Monk on Feb 15, 2012 at 22:57 UTC

    1. I read through the info on HTTP::Recorder. It seemed promising (its cool in its own right... thanks for the link), but unfortunately, it doesn't support JavaScript interactions (as mentioned there and on CPAN).

    Sure it does, http is http, and everything ajax is http

    I'll def give it a try, but I'm disappointed that it sounds there isn't any way to do this strictly with Perl modules without needing 3rd party software to "cheat" to avoid the JS/AJAX. Seems like something should be out there to do this with Perl, as I'm certainly not the first to want to scrape such a page.

    There is no cheating of any going on.

    Do you want to spend 10min with firefox to figure out how to get data out of a website?

    or do you want to spend 20+ years to create a pure-perl browser?

    Are you disappointed that perl is written in C instead of Perl? Is using gcc to compile perl cheating?

    Scripter::Plugin / Scripter::Plugin::JavaScript, WWW::Selenium, WWW::HtmlUnit, Gtk2::WebKit::Mechanize/Gtk3::WebKit, Win32::Watir/Win32::IEAutomation/Win32::IE::Mechanize

      I'm not sure that you understand what I am getting at. No intention of creating a pure perl browser, but the intention of the modules and bot programmers is to automate the scraping. The analogy you give is not the case here, as we are talking about the higher functionality layer... not the underlying code used. The modules advertise or imply their ability to automate this type of interaction, so going outside of it seems that either 1. the module is truly more limited, or 2. (most likely) I am not understanding a way to use it. Your answer, if I understand it, suggests that I would need to abandon a pure perl / automated solution. If so, so be it, but I want to make sure current perl mods can't. I am inclined to guess they can, because they claim JS/Ajax support and are known to handle button clicks within forms.

      1. Yes, HTTP occurs and yes HTTP::Recorder deals with http. But the heart of the issue at hand is mechanization. Recorder does not mechanize the javascript button clicks, as mentioned in CPAN doc.

      2. The intent is to do this programatically. So when I refer to needing a 3rd party browser tool to see and then mimic http of the button actions as "cheating", what I mean is that the very intention of these perl modules is to automate and handle browser interactions, including JS / clicks, robustly.

      3. Now I could be misunderstanding you, completely. I am familiar with WWW::Mechanize (and some similar), but not WWW::Mechanize::Firefox, which perhaps has some kind of ability to utilize the LiveHttpHeaders plugin to do its own handling of button clicks. The way I read your suggestion is to use the firefox plugin to myself comb the logged http interactions, and then using Mechanize, etc to mimic the button click methods buy just plugging in the http I sniffed. My appologies if I'm not understanding correctly. And thanks again for the suggestions so far. I'm sure you are more experienced than I at this, so please bear with me if I'm misunderstanding.

        Monk-E you have a good question here, but the language you use makes it sound as if these modules are making false claims, that they have somehow lied to you?

        Also, I think what you are describing as "button clicks" are not that at all, they seem more like standard <a href> links which are intercepted by Javascript. So you may be looking for the wrong thing in the docs.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://954090]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-03-29 06:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found