goodepic has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to scrape a large website. There are two single select drop down lists that refresh the page and populate a third single select drop down list. After selecting from this list, you click on one of 8 links below. The URL for this in the tag in the page HTML is "#", and it says onClick="tohtm('../.*.php'). In Firefox this opens up a new page/tab and brings you to a data table whose contents I need.
I'm using WWW::Mechanize for this. I can log in through the first page of this site, and follow a link to get to the page described above. Then I've tried selecting the two first single selects (after selecting the form they're inside by name), but that doesn't seem to work. The responses I get still have an unpopulated last drop down select control.
Luckily changing those first two selects brings you to a different URL. So I've also tried just $browser->get()ting that URL and then trying to select and submit/click from the 3rd drop down select menu. Then I've tried following the link to the data through the follow_link function, but this just brings me back to the same page, with a "#" tacked onto the URL given. I've also tried just getting the URL for the data page directly after selecting from the third drop down menu, but that gives me a page with an empty data table that isn't empty when accessed properly through the browser.
Below are some snippets from the HTML of the page I'm working with and the key lines from the code I'm trying to get to work.
<form name="sipp" method="post" target="_blank">
input name="ses_id" type="hidden" value="sid">
...
<select name="fiscal" size="1" style="width:150" onChange="enableIt(this,document.sipp.propinsi); gatherInfothn(this, 'thn='); getval(this,document.sipp.thnang);">
<option value="0">Pilih Tahun</option>
<option value="2008" selected>2008</option>
<option value="2007">2007</option>
<option value="2006">2006</option>
<option value="2005">2005</option>
<option value="2004">2004</option>
</select>
<input type="hidden" name="thnang" value="">
...
<select name="propinsi" style="width:150" onChange="enableIt(this,document.sipp.proyektemp); gatherInfoProp(document.sipp.fiscal, this, 'thn=','&kdprop='); getName(this,document.sipp.nmpropinsi);">
<option value="0">Pilih Propinsi</option>
<option value="01" selected>DKI Jakarta </option>
<option value="02">Jawa Barat </option>
<option value="03">Jawa Tengah </option>
And the links to the data table I need look like this. Note the dots in the HTML tags are just so this shows up looking right here:
<..tr>
<..td class="namaForm">Form A-3<../td>
<..td class="content" onMouseOver="this.bgColor='#EAEAEA'" onMouseOut="this.bgColor='#FFFFFF'">
<..a href="#" onClick="tohtm('../sipp2005/form_A3.php')">Laporan Paket Kontrak<../a>
<../td>
<..td align="center" class="clickableTXT"><img src="../images/xls.gif" alt="Simpan ke file Excel dan Print" width="20" height="20" onClick="toxls('../sipp2005/form_A3.php')"><../td>
<../tr>
Finally, here's some of my code. This comes after I've already logged in and followed a link the page where the HTML above comes from.
Then I've tried both of the following.$br->get('http://.../sipp.php?thn=2008&kdprop=01'); # $br is initialized from mechanize: my $br = WWW::Mechanize->new(); # Set ->agent_alias('Windows IE 6'); my $resp = $br->content(); $resp =~ s/\x0D//g; # On a mac here, get ^M at the end of each line my @pt_vals = get_proyektemp_values($resp); #Don't want to use mech-dump, so just regexing the newly populated val +ues of the 3rd drop down menu $br->form_name('sipp'); $br->field('proyektemp', "$pt_vals[1]"); $br->submit();
#1 my $link_resp = $br->follow_link(text_regex => qr/paket\s+kontrak/i); #2 $br->get('.../sipp2005/form_A3.php');
P1 just brings me back to the same page I started on. 2 gets me to the data table page, but with an (incorrectly) empty table. Am I just being a newbie web programmer idiot? Is this a Javascript problem? Are these select controls and links all calling javascript functions, which aren't interpreted in Mechanize? Are there other libraries that would scrape this page successfully? I've also tried the Python version of Mechanize, but had no success there either.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Mechanize, Forms, Links, problem from Javascript?
by Cody Pendant (Prior) on Jun 20, 2008 at 03:27 UTC | |
by goodepic (Initiate) on Jun 20, 2008 at 21:21 UTC | |
by Cody Pendant (Prior) on Jun 22, 2008 at 06:24 UTC | |
by goodepic (Initiate) on Jun 24, 2008 at 22:47 UTC | |
by Anonymous Monk on Jun 21, 2008 at 10:35 UTC | |
by goodepic (Initiate) on Jun 21, 2008 at 22:50 UTC | |
by runrig (Abbot) on Jun 25, 2008 at 00:04 UTC | |
Re: Mechanize, Forms, Links, problem from Javascript?
by Anonymous Monk on Jun 20, 2008 at 03:25 UTC |