http://qs1969.pair.com?node_id=954061

Monk-E has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks. I'm working on a scraping bot. I've looked into it pretty extensively (incl. reading here, which helped with some initial pointers) but am at a roadblock, namely with getting content from the pop-up that results from a button click. The pop-up uses JavaScript / Ajax. I've included snippets of the target page HTML/scripts and my code, below. Any help is appreciated. My Goal: I'm able to scrape the content from my target web page at a basic level (non-popup content). But the real target is to scrape the info that comes up when I click a button (its details below).

My Env: Cygwin on Windows Vista; Perl 5.10 - I have loaded WWW::Scripter, WWW::Mechanize, WWW::Scriter::Plugin::Ajax (and "..."::JavaScript), LWP*, etc modules. I know there are methods in modules that are supposed to work, but see below for my results trying to use them.

===============
Target site's relevant (and possibly irrelevant... im certainly no expert) XHTML: *Note: I've had to edit some things out since PerlMonks doesn't like to allow links in posts.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/ +/www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> ... <script src="http://www.foo.com/js/jquery-latest.min.js" type="text/ja +vascript"></script> <l*nk hr*f="http://www.foo.com/css/foobar.min.css?v=44" media="screen" + rel="stylesheet" type="text/css" /> <script src="http://www.foo.com/AJAX/foobar_ajax_functions.min.js?v=44 +" type="text/javascript" charset="utf-8"> </script> <script type="text/javascript"> if (document.location.protocol == 'https:') { var __AJAX_URL__ = "https://www.foo.com/AJAX"; var __AJAX_URL_SECURE__ = "https://www.foo.com/AJAX"; var __JS_ONLY_SITE_BASE__ = "https://www.foo.com"; } else { var __AJAX_URL__ = "http://www.foo.com/AJAX"; var __AJAX_URL_SECURE__ = "https://www.foo.com/AJAX"; var __JS_ONLY_SITE_BASE__ = "http://www.foo.com"; } var __BASE_URL__ = "http://www.foo.com"; var __BASE_URL_SECURE__ = "https://www.foo.com"; $(document).ready(function(){ $("label").inFieldLabels(); + }); </script> </head> <body> ... <div id="entrants"> ... <div id="entrants_nav"> <a href="javascript:void(0);" onclick="displayAllEntrants(442111,3 +918381,3);"> <img src="http://www.foo.com/images/stat_tracking_view_all_button. +png" alt="Show All Entrants Button" onmouseover="this.src='http://w +ww.foo.com/images/stat_tracking_view_all_button_over.p ng';" o +nmouseout="this.src='http://www.foo.com/images/stat_tracking_view_all +_button.png';" /> </a> </div> </div> <div id="pays_stat_tracking"> ... <div id="pays_nav"> <a href="javascript:void(0);" id="pays_viewall" onclick="javascript:displayAllPayPosition(442111,3918381);"> <img src="http://www.foo.com/images/stat_tracking_view_all_button. +png" alt="View All Pays" onmouseover="this.src='http://www.fo +o.com/images/stat_tracking_view_all_button_over.png';" onmouseout +="this.src='http://www.foo.com/images/stat_tracking_view_all_button.p +ng';" /> </a> </div> </div> ... <script type="text/javascript"> $(document).ready(function() { var opponent_id = 3907701; $("select[name=select-choice]").change(function () { $.get('/ajax_GetUserDetails.php?id='+$("select[name=select-choice] + option:selected").val()+'&param=YTo5ODM6e2k6MDthOjY6e3M6NjoiaWRVc2Vy +IjtzOjU6IjExMjYxIjtzmlkIjzOj[...]6==', function(data17) { + userdata = $.parseJSON(data17); draft_id = userdata.draft; var today = new Date(); var dadd = new Date(today.getTime() + 100000); document.cookie = "442111opid="+draft_id+'; expires=' + dadd. +toGMTString(); location.reload(true); }); }); is_started = 0; }); </script> ... </body>
===========
The two onclick actions I'm interested in are "displayAllEntrants()" and "javascript:displayAllPayPosition()" above. The buttons do NOT appear within a <form> elemlent. Methods, such as click_button() and click($button) in perl's Scripter module (and its parent Mechanize) seem to assume the button is part of a from (mine aren't), as they are described in their CPAN docs. At any rate, I've tried them anyway without success so far. When I get the anchor tag DOM Elements by using $scripter->document->getElementsByTagName('a'), I'm able to find the two elements I'm interested in (the buttons with the onclick attribute I'm looking for). But when I try the HTML::DOM::Element's click() method for that element, I dont get the content I need. It seems to be ok that the button is not in a form, but CPAN doc seems to say click is only supported for HTML5. So now what? (My code snippet is below.) Thanks in advance. ============
My code snippet:
use WWW::Scripter; # ... $url = "http://www.foo.com/stats/xyz/"; # my target site. $w = new WWW::Scripter; $w->use_plugin('Ajax'); $content = $w->get($url); # @div_elems = $w->document->getElementsByTagName('div'); @anchor_elems = $w->document->getElementsByTagName('a'); foreach $anchor_elem (@anchor_elems) { $click_attr = $anchor_elem->getAttribute('onclick'); if ( ($click_attr =~ m/displayAllPayPosition/) || ($click_attr =~ +m/displayAllEntrants/) ) { # We're finding this attr and getting here. $click_result = $anchor_elem->click(); # Only supported for H +TML5? print "CLICK RESULT: ", $click_result, "\n"; # Issue: shows + nothing. # So I also tried: $response_info = $anchor_elem->trigger_event('onclick'); print "RESP INFO: ", $response_info, "\n"; $resp_content = $response_info->content; print "RESP CONTENT: ", $resp_content, "\n"; # TODO: or could we use $click_resp = $w->click($button_name); + ?? #This would return an HTTP::Response object.. then use $click_ +resp->content # to get at the content. But what is our $button_name? we're +not in a form. Dont # think we can use it. Other options?? } } exit;