Hi Perl Monks. I'm working on a scraping bot. I've looked into it pretty extensively (incl. reading here, which helped with some initial pointers) but am at a roadblock, namely with getting content from the pop-up that results from a button click. The pop-up uses JavaScript / Ajax. I've included snippets of the target page HTML/scripts and my code, below. Any help is appreciated.
My Goal: I'm able to scrape the content from my target web page at a basic level (non-popup content). But the real target is to scrape the info that comes up when I click a button (its details below).
My Env: Cygwin on Windows Vista; Perl 5.10 - I have loaded WWW::Scripter, WWW::Mechanize, WWW::Scriter::Plugin::Ajax (and "..."::JavaScript), LWP*, etc modules. I know there are methods in modules that are supposed to work, but see below for my results trying to use them.
===============
Target site's relevant (and possibly irrelevant... im certainly no expert) XHTML:
*Note: I've had to edit some things out since PerlMonks doesn't like to allow links in posts.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/
+/www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
...
<script src="http://www.foo.com/js/jquery-latest.min.js" type="text/ja
+vascript"></script>
<l*nk hr*f="http://www.foo.com/css/foobar.min.css?v=44" media="screen"
+ rel="stylesheet" type="text/css" />
<script src="http://www.foo.com/AJAX/foobar_ajax_functions.min.js?v=44
+" type="text/javascript" charset="utf-8">
</script>
<script type="text/javascript">
if (document.location.protocol == 'https:') {
var __AJAX_URL__ = "https://www.foo.com/AJAX";
var __AJAX_URL_SECURE__ = "https://www.foo.com/AJAX";
var __JS_ONLY_SITE_BASE__ = "https://www.foo.com";
} else {
var __AJAX_URL__ = "http://www.foo.com/AJAX";
var __AJAX_URL_SECURE__ = "https://www.foo.com/AJAX";
var __JS_ONLY_SITE_BASE__ = "http://www.foo.com";
}
var __BASE_URL__ = "http://www.foo.com";
var __BASE_URL_SECURE__ = "https://www.foo.com";
$(document).ready(function(){ $("label").inFieldLabels();
+ });
</script>
</head>
<body>
...
<div id="entrants">
...
<div id="entrants_nav">
<a href="javascript:void(0);" onclick="displayAllEntrants(442111,3
+918381,3);">
<img src="http://www.foo.com/images/stat_tracking_view_all_button.
+png"
alt="Show All Entrants Button" onmouseover="this.src='http://w
+ww.foo.com/images/stat_tracking_view_all_button_over.p ng';" o
+nmouseout="this.src='http://www.foo.com/images/stat_tracking_view_all
+_button.png';" /> </a>
</div>
</div>
<div id="pays_stat_tracking">
...
<div id="pays_nav">
<a href="javascript:void(0);" id="pays_viewall"
onclick="javascript:displayAllPayPosition(442111,3918381);">
<img src="http://www.foo.com/images/stat_tracking_view_all_button.
+png" alt="View All Pays" onmouseover="this.src='http://www.fo
+o.com/images/stat_tracking_view_all_button_over.png';" onmouseout
+="this.src='http://www.foo.com/images/stat_tracking_view_all_button.p
+ng';" /> </a>
</div>
</div>
...
<script type="text/javascript">
$(document).ready(function() {
var opponent_id = 3907701;
$("select[name=select-choice]").change(function () {
$.get('/ajax_GetUserDetails.php?id='+$("select[name=select-choice]
+ option:selected").val()+'¶m=YTo5ODM6e2k6MDthOjY6e3M6NjoiaWRVc2Vy
+IjtzOjU6IjExMjYxIjtzmlkIjzOj[...]6==', function(data17) {
+
userdata = $.parseJSON(data17);
draft_id = userdata.draft;
var today = new Date();
var dadd = new Date(today.getTime() + 100000);
document.cookie = "442111opid="+draft_id+'; expires=' + dadd.
+toGMTString();
location.reload(true);
});
});
is_started = 0;
});
</script>
...
</body>
===========
The two onclick actions I'm interested in are "displayAllEntrants()" and "javascript:displayAllPayPosition()" above.
The buttons do NOT appear within a <form> elemlent. Methods, such as click_button() and click($button) in perl's Scripter module (and its parent Mechanize) seem to assume the button is part of a from (mine aren't), as they are described in their CPAN docs. At any rate, I've tried them anyway without success so far.
When I get the anchor tag DOM Elements by using $scripter->document->getElementsByTagName('a'), I'm able to find the two elements I'm interested in (the buttons with the onclick attribute I'm looking for). But when I try the HTML::DOM::Element's click() method for that element, I dont get the content I need. It seems to be ok that the button is not in a form, but CPAN doc seems to say click is only supported for HTML5.
So now what? (My code snippet is below.) Thanks in advance.
============
My code snippet:
use WWW::Scripter;
# ...
$url = "http://www.foo.com/stats/xyz/"; # my target site.
$w = new WWW::Scripter;
$w->use_plugin('Ajax');
$content = $w->get($url);
# @div_elems = $w->document->getElementsByTagName('div');
@anchor_elems = $w->document->getElementsByTagName('a');
foreach $anchor_elem (@anchor_elems) {
$click_attr = $anchor_elem->getAttribute('onclick');
if ( ($click_attr =~ m/displayAllPayPosition/) || ($click_attr =~
+m/displayAllEntrants/) ) {
# We're finding this attr and getting here.
$click_result = $anchor_elem->click(); # Only supported for H
+TML5?
print "CLICK RESULT: ", $click_result, "\n"; # Issue: shows
+ nothing.
# So I also tried:
$response_info = $anchor_elem->trigger_event('onclick');
print "RESP INFO: ", $response_info, "\n";
$resp_content = $response_info->content;
print "RESP CONTENT: ", $resp_content, "\n";
# TODO: or could we use $click_resp = $w->click($button_name);
+ ??
#This would return an HTTP::Response object.. then use $click_
+resp->content
# to get at the content. But what is our $button_name? we're
+not in a form. Dont
# think we can use it. Other options??
}
}
exit;