Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Scraping Ajax / JS pop-up

by Monk-E (Initiate)
on Feb 15, 2012 at 20:11 UTC ( [id://954061]=perlquestion: print w/replies, xml ) Need Help??

Monk-E has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks. I'm working on a scraping bot. I've looked into it pretty extensively (incl. reading here, which helped with some initial pointers) but am at a roadblock, namely with getting content from the pop-up that results from a button click. The pop-up uses JavaScript / Ajax. I've included snippets of the target page HTML/scripts and my code, below. Any help is appreciated. My Goal: I'm able to scrape the content from my target web page at a basic level (non-popup content). But the real target is to scrape the info that comes up when I click a button (its details below).

My Env: Cygwin on Windows Vista; Perl 5.10 - I have loaded WWW::Scripter, WWW::Mechanize, WWW::Scriter::Plugin::Ajax (and "..."::JavaScript), LWP*, etc modules. I know there are methods in modules that are supposed to work, but see below for my results trying to use them.

===============
Target site's relevant (and possibly irrelevant... im certainly no expert) XHTML: *Note: I've had to edit some things out since PerlMonks doesn't like to allow links in posts.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/ +/www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> ... <script src="http://www.foo.com/js/jquery-latest.min.js" type="text/ja +vascript"></script> <l*nk hr*f="http://www.foo.com/css/foobar.min.css?v=44" media="screen" + rel="stylesheet" type="text/css" /> <script src="http://www.foo.com/AJAX/foobar_ajax_functions.min.js?v=44 +" type="text/javascript" charset="utf-8"> </script> <script type="text/javascript"> if (document.location.protocol == 'https:') { var __AJAX_URL__ = "https://www.foo.com/AJAX"; var __AJAX_URL_SECURE__ = "https://www.foo.com/AJAX"; var __JS_ONLY_SITE_BASE__ = "https://www.foo.com"; } else { var __AJAX_URL__ = "http://www.foo.com/AJAX"; var __AJAX_URL_SECURE__ = "https://www.foo.com/AJAX"; var __JS_ONLY_SITE_BASE__ = "http://www.foo.com"; } var __BASE_URL__ = "http://www.foo.com"; var __BASE_URL_SECURE__ = "https://www.foo.com"; $(document).ready(function(){ $("label").inFieldLabels(); + }); </script> </head> <body> ... <div id="entrants"> ... <div id="entrants_nav"> <a href="javascript:void(0);" onclick="displayAllEntrants(442111,3 +918381,3);"> <img src="http://www.foo.com/images/stat_tracking_view_all_button. +png" alt="Show All Entrants Button" onmouseover="this.src='http://w +ww.foo.com/images/stat_tracking_view_all_button_over.p ng';" o +nmouseout="this.src='http://www.foo.com/images/stat_tracking_view_all +_button.png';" /> </a> </div> </div> <div id="pays_stat_tracking"> ... <div id="pays_nav"> <a href="javascript:void(0);" id="pays_viewall" onclick="javascript:displayAllPayPosition(442111,3918381);"> <img src="http://www.foo.com/images/stat_tracking_view_all_button. +png" alt="View All Pays" onmouseover="this.src='http://www.fo +o.com/images/stat_tracking_view_all_button_over.png';" onmouseout +="this.src='http://www.foo.com/images/stat_tracking_view_all_button.p +ng';" /> </a> </div> </div> ... <script type="text/javascript"> $(document).ready(function() { var opponent_id = 3907701; $("select[name=select-choice]").change(function () { $.get('/ajax_GetUserDetails.php?id='+$("select[name=select-choice] + option:selected").val()+'&param=YTo5ODM6e2k6MDthOjY6e3M6NjoiaWRVc2Vy +IjtzOjU6IjExMjYxIjtzmlkIjzOj[...]6==', function(data17) { + userdata = $.parseJSON(data17); draft_id = userdata.draft; var today = new Date(); var dadd = new Date(today.getTime() + 100000); document.cookie = "442111opid="+draft_id+'; expires=' + dadd. +toGMTString(); location.reload(true); }); }); is_started = 0; }); </script> ... </body>
===========
The two onclick actions I'm interested in are "displayAllEntrants()" and "javascript:displayAllPayPosition()" above. The buttons do NOT appear within a <form> elemlent. Methods, such as click_button() and click($button) in perl's Scripter module (and its parent Mechanize) seem to assume the button is part of a from (mine aren't), as they are described in their CPAN docs. At any rate, I've tried them anyway without success so far. When I get the anchor tag DOM Elements by using $scripter->document->getElementsByTagName('a'), I'm able to find the two elements I'm interested in (the buttons with the onclick attribute I'm looking for). But when I try the HTML::DOM::Element's click() method for that element, I dont get the content I need. It seems to be ok that the button is not in a form, but CPAN doc seems to say click is only supported for HTML5. So now what? (My code snippet is below.) Thanks in advance. ============
My code snippet:
use WWW::Scripter; # ... $url = "http://www.foo.com/stats/xyz/"; # my target site. $w = new WWW::Scripter; $w->use_plugin('Ajax'); $content = $w->get($url); # @div_elems = $w->document->getElementsByTagName('div'); @anchor_elems = $w->document->getElementsByTagName('a'); foreach $anchor_elem (@anchor_elems) { $click_attr = $anchor_elem->getAttribute('onclick'); if ( ($click_attr =~ m/displayAllPayPosition/) || ($click_attr =~ +m/displayAllEntrants/) ) { # We're finding this attr and getting here. $click_result = $anchor_elem->click(); # Only supported for H +TML5? print "CLICK RESULT: ", $click_result, "\n"; # Issue: shows + nothing. # So I also tried: $response_info = $anchor_elem->trigger_event('onclick'); print "RESP INFO: ", $response_info, "\n"; $resp_content = $response_info->content; print "RESP CONTENT: ", $resp_content, "\n"; # TODO: or could we use $click_resp = $w->click($button_name); + ?? #This would return an HTTP::Response object.. then use $click_ +resp->content # to get at the content. But what is our $button_name? we're +not in a form. Dont # think we can use it. Other options?? } } exit;

Replies are listed 'Best First'.
Re: Scraping Ajax / JS pop-up
by Anonymous Monk on Feb 15, 2012 at 20:16 UTC
        Hi Monk-E, Did you solve that problem? I am also involved similar problem. If you solve it, show me the code you edited. Thanks for your attention!
      Thanks for the reply.

      1. I read through the info on HTTP::Recorder. It seemed promising (its cool in its own right... thanks for the link), but unfortunately, it doesn't support JavaScript interactions (as mentioned there and on CPAN).

      2. The problems I mentioned having with mechanizing button clicks on this page using WWW::Mechanize look to be the same for WWW::Mechanize::Firefox, which inherits from it. CPAN shows the methods for button clicks expect them to be within a form. But if I understand your suggestion, its to use the Firefox plugin (LiveHttpHeaders) to avoid any button clicks / AJAX/JS calls, and just make the equivalent http requests directly from my scrape-bot code. (?)

      I'll def give it a try, but I'm disappointed that it sounds there isn't any way to do this strictly with Perl modules without needing 3rd party software to "cheat" to avoid the JS/AJAX. Seems like something should be out there to do this with Perl, as I'm certainly not the first to want to scrape such a page.

      Thanks again, I'll still try your suggested work-around.

        Monk-E, everything that JavaScript does is client-side, so it has to send a request back to the server at some point, and you can just use Perl to mimic that request. If you want a solution that completely automates JavaScript in any situation then you need to make a script that parses JavaScript, which would take an enormous amount of time.
        Instead you might,for example, download firebug for FireFox, right click and Inspect the Element of the particular button you're wandering about and see what address is being requested from the server, use regular expressions to isolate that address and then just have your script get() that address and continue.

        He picked the perl from the dying flesh and held it in his palm, and he turned it over and saw that its curve was perfect in the hand he had smashed against the gate; the torn flesh of the knuckles was turned grayish white by the sea water. ~Steinbeck's "The Perl"

        1. I read through the info on HTTP::Recorder. It seemed promising (its cool in its own right... thanks for the link), but unfortunately, it doesn't support JavaScript interactions (as mentioned there and on CPAN).

        Sure it does, http is http, and everything ajax is http

        I'll def give it a try, but I'm disappointed that it sounds there isn't any way to do this strictly with Perl modules without needing 3rd party software to "cheat" to avoid the JS/AJAX. Seems like something should be out there to do this with Perl, as I'm certainly not the first to want to scrape such a page.

        There is no cheating of any going on.

        Do you want to spend 10min with firefox to figure out how to get data out of a website?

        or do you want to spend 20+ years to create a pure-perl browser?

        Are you disappointed that perl is written in C instead of Perl? Is using gcc to compile perl cheating?

        Scripter::Plugin / Scripter::Plugin::JavaScript, WWW::Selenium, WWW::HtmlUnit, Gtk2::WebKit::Mechanize/Gtk3::WebKit, Win32::Watir/Win32::IEAutomation/Win32::IE::Mechanize

Re: Scraping Ajax / JS pop-up
by wazoox (Prior) on Feb 16, 2012 at 14:47 UTC
    Alternatively, you may be able to actually execute the javascript code from your perl code, through the JavaScript module. I don't know how DOM interactions and the like are to be managed, though. You may need to simulate some more "human" activity.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://954061]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (9)
As of 2024-04-18 14:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found