Greetings Monks,
I have a web server on our intranet (Win2k running MS IIS).
I want to access data on a web site (.NET aspx) on this server.
I must first enter data into a
form on the site primary page, setting the field name 'miljoe' = value 'UDV, and do a 'GET /RSData.aspx?miljoe=UDV' to the server. This fetches a secondary page containing another form.
I must then repeatedly enter data into some fields (ie name 'TextBoxProductID' = value 'KMD.NI.DPSagsbehandler', and name 'Button1' = value 'Opdater filter') on this
second form and POST it to fetch logging info from a DB, which the server enters into a HTML table on the page, from where i can retrieve it.
I have previously used Win32::IE::Mechanize to navigate the pages on this server, but in the current release i want to try LWP to accomplish the same task.
My problem is, that I can't get LWP to navigate to and enter data into the fields on the form on the second page. Can LWP only be used for shallow screen scraping of directly addressable web pages, and not for deeper navigation and extraction?
server: http://rswatch
page1: /RSData.aspx # form with field 'miljoe'
page2: /RSData.aspx?miljoe=UDV # form with field ''TextBoxProductID'
+and button 'Button1'
Here's a trace of my POST to the server, -- it doesn't work.
POST /RSData.aspx?miljoe=UDV HTTP/1.1
TE: deflate,gzip;q=0.3
Connection: TE
Authorization: Basic S01EXHo2YW5kOno2YW5keXl5
Host: rswatch
User-Agent: libwww-perl/5.805
Content-Length: 151
Content-Type: application/x-www-form-urlencoded
DropDownListType=-TextBoxGUID&-=TextBoxUserName&-=TextBoxKommunenr&-=T
+extBoxProductID&KMD.NI.DPSagsbehandler=TextBoxShortText&-=Button1&Opd
+ater+filter=
HTTP/1.1 200 OK
Date: Mon, 25 Dec 2006 14:25:36 GMT
Server: Microsoft-IIS/6.0
MicrosoftOfficeWebServer: 5.0_Pub
X-Powered-By: ASP.NET
X-AspNet-Version: 1.1.4322
Set-Cookie: ASP.NET_SessionId=4dbkrgn4idtdwbuptotubemu; path=/
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 11883
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<HTML>
<HEAD>
<span id="Label2"><title>RSWatch - UDV</title></span>
<meta content="Microsoft Visual Studio .NET 7.1" name="GENERAT
+OR">
<meta content="C#" name="CODE_LANGUAGE">
<meta content="JavaScript" name="vs_defaultClientScript">
<meta content="http://schemas.microsoft.com/intellisense/ie5"
+name="vs_targetSchema">
<LINK href="StyleSheet1.css" type="text/css" rel="stylesheet">
</HEAD>
<body>
<center>
<table class="BodyTable">
<tr>
<td class="TDheaderUnderline"><A href="default.asp
+x">RSWatch</A> -
<span id="Label1">UDV</span><a name="top">&nbs
+p;</a></td>
</tr>
<tr>
<td class="BodyTable">
<form name="Form1" method="post" action="RSDat
+a.aspx?miljoe=UDV" id="Form1">
<input type="hidden" name="__VIEWSTATE" value="dDwxMTA1MDg5NDkzO3Q8O2w
+8aTwxPjtpPDM+Oz47bDx0PHA8cDxsPFRleHQ7PjtsPFw8dGl0bGVcPlJTV2F0Y2ggLSBV
+RFZcPC90aXRsZVw+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDxVRFY7Pj47Pjs7Pjs+P
+js+QtBaNAQOnC4Eqk2prlcPA4K8wqw=" />
<table class="noborder">
<tr>
<td class="noborder"><a id="HyperL
+ink2" title="forrige" href="/RSData.aspx?Miljoe=UDV&StartFejllogI
+d=3118741"><--</a>
<a id="HyperLink1" title="n..s
+te" href="/RSData.aspx?Miljoe=UDV&StartFejllogId=3118781">--></a>
+
<a id="HyperLink3" title="til
+top" href="/RSData.aspx?Miljoe=UDV&StartFejllogId=2147483647">-->
+></a></td>
<td class="noborder"> &
+nbsp; </td>
<td class="noborder"><select name=
+"DropDownListType" id="DropDownListType">
<option selected="selected" value="-">-</option>
<option value="E">E</option>
<option value="S">S</option>
<option value="W">W</option>
<option value="R">R</option>
<option value="T">T</option>
</select></td>
<td class="noborder"><input name="
+TextBoxGUID" type="text" value="-" id="TextBoxGUID" /></td>
<td class="noborder"><input name="
+TextBoxUserName" type="text" value="-" id="TextBoxUserName" /></td>
<td class="noborder"><input name="
+TextBoxKommunenr" type="text" value="-" id="TextBoxKommunenr" /></td>
<td class="noborder"><input name="
+TextBoxProductID" type="text" value="-" id="TextBoxProductID" /></td>
<td class="noborder"><input name="
+TextBoxShortText" type="text" value="-" id="TextBoxShortText" /></td>
<td class="noborder"><input type="
+submit" name="Button1" value="Opdater filter" id="Button1" /></td>
</tr>
</table>
</form>
<!--table content cut out here -->
</body>
</HTML>
And here's a trace of manually entering the desired data and posting from a browser.
POST /RSData.aspx?miljoe=UDV HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, applicati
+on/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, a
+pplication/x-shockwave-flash, */*
Referer: http://rswatch/RSData.aspx?miljoe=UDV
Accept-Language: da
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .N
+ET CLR 1.1.4322; InfoPath.1)
Host: rswatch
Content-Length: 0
Connection: Keep-Alive
Cache-Control: no-cache
Cookie: ASP.NET_SessionId=0i4zi0q0uag51lypzvg4m0va
Authorization: Negotiate TlRMTVNTUAABAAAAB4IIogAAAAAAAAAAAAAAAAAAAAAFA
+SgKAAAAD0==
HTTP/1.1 401 Unauthorized
Content-Length: 83
Content-Type: text/html
Server: Microsoft-IIS/6.0
WWW-Authenticate: Negotiate TlRMTVNTUAACAAAABgAGADgAAAAFgomixPDhPomZ5s
+YAAAAAAAAAAI4AjgA+AAAABQLODgAAAA9LAE0ARAACAAYASwBNAEQAAQAQAE8ARABTAFc
+ARQBCADAAMQAEABoAaQBuAHQAZQByAG4ALgBrAG0AZAAuAGQAawADACwATwBEAFMAVwBF
+AEIAMAAxAC4AaQBuAHQAZQByAG4ALgBrAG0AZAAuAGQAawAFABoAaQBuAHQAZQByAG4AL
+gBrAG0AZAAuAGQAawAAAAAA
MicrosoftOfficeWebServer: 5.0_Pub
X-Powered-By: ASP.NET
Date: Mon, 25 Dec 2006 17:43:04 GMT
<html><head><title>Error</title></head><body>Error: Access is Denied.<
+/body></html>
POST /RSData.aspx?miljoe=UDV HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, applicati
+on/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, a
+pplication/x-shockwave-flash, */*
Referer: http://rswatch/RSData.aspx?miljoe=UDV
Accept-Language: da
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .N
+ET CLR 1.1.4322; InfoPath.1)
Host: rswatch
Content-Length: 368
Connection: Keep-Alive
Cache-Control: no-cache
Cookie: ASP.NET_SessionId=0i4zi0q0uag51lypzvg4m0va
Authorization: Negotiate TlRMTVNTUAADAAAAGAAYAGQAAAAYABgAfAAAAAYABgBIA
+AAACgAKAE4AAAAMAAwAWAAAAAAAAACUAAAABYKIogUBKAoAAAAPSwBNAEQAWgA2AEEATg
+BEAEgAMgA0ADkANgA0AG2gazXZgVp0AAAAAAAAAAAAAAAAAAAAAC4sefx6XWUzFigAY3I
+xHngpT+49JULFTA==
__VIEWSTATE=dDwxMTA1MDg5NDkzO3Q8O2w8aTwxPjtpPDM%2BOz47bDx0PHA8cDxsPFRl
+eHQ7PjtsPFw8dGl0bGVcPlJTV2F0Y2ggLSBVRFZcPC90aXRsZVw%2BOz4%2BOz47Oz47d
+DxwPHA8bDxUZXh0Oz47bDxVRFY7Pj47Pjs7Pjs%2BPjs%2BQtBaNAQOnC4Eqk2prlcPA4
+K8wqw%3D&DropDownListType=-&TextBoxGUID=-&TextBoxUserName=-&TextBoxKo
+mmunenr=-&TextBoxProductID=-&TextBoxShortText=KMD.NI.DPSagsbehandler&
+Button1=Opdater+filter
HTTP/1.1 200 OK
Date: Mon, 25 Dec 2006 17:43:18 GMT
Server: Microsoft-IIS/6.0
MicrosoftOfficeWebServer: 5.0_Pub
X-Powered-By: ASP.NET
X-AspNet-Version: 1.1.4322
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 11823
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<HTML>
<HEAD>
<span id="Label2"><title>RSWatch - UDV</title></span>
<meta content="Microsoft Visual Studio .NET 7.1" name="GENERAT
+OR">
<meta content="C#" name="CODE_LANGUAGE">
<meta content="JavaScript" name="vs_defaultClientScript">
<meta content="http://schemas.microsoft.com/intellisense/ie5"
+name="vs_targetSchema">
<LINK href="StyleSheet1.css" type="text/css" rel="stylesheet">
</HEAD>
<body>
<center>
<table class="BodyTable">
<tr>
<td class="TDheaderUnderline"><A href="default.asp
+x">RSWatch</A> -
<span id="Label1">UDV</span><a name="top">&nbs
+p;</a></td>
</tr>
<tr>
<td class="BodyTable">
<form name="Form1" method="post" action="RSDat
+a.aspx?miljoe=UDV" id="Form1">
<input type="hidden" name="__VIEWSTATE" value="dDwxMTA1MDg5NDkzO3Q8O2w
+8aTwxPjtpPDM+Oz47bDx0PHA8cDxsPFRleHQ7PjtsPFw8dGl0bGVcPlJTV2F0Y2ggLSBV
+RFZcPC90aXRsZVw+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDxVRFY7Pj47Pjs7Pjs+P
+js+QtBaNAQOnC4Eqk2prlcPA4K8wqw=" />
<table class="noborder">
<tr>
<td class="noborder"><a id="HyperL
+ink2" title="forrige" href="/RSData.aspx?Miljoe=UDV&StartFejllogI
+d=2769544"><--</a>
<a id="HyperLink3" title="til
+top" href="/RSData.aspx?Miljoe=UDV&StartFejllogId=2147483647">-->
+></a></td>
<td class="noborder"> &
+nbsp; </td>
<td class="noborder"><select name=
+"DropDownListType" id="DropDownListType">
<option selected="selected" value="-">-</option>
<option value="E">E</option>
<option value="S">S</option>
<option value="W">W</option>
<option value="R">R</option>
<option value="T">T</option>
</select></td>
<td class="noborder"><input name="
+TextBoxGUID" type="text" value="-" id="TextBoxGUID" style="width:150p
+x;" /></td>
<td class="noborder"><input name="
+TextBoxUserName" type="text" value="-" id="TextBoxUserName" style="wi
+dth:80px;" /></td>
<td class="noborder"><input name="
+TextBoxKommunenr" type="text" value="-" id="TextBoxKommunenr" style="
+width:80px;" /></td>
<td class="noborder"><input name="
+TextBoxProductID" type="text" value="-" id="TextBoxProductID" style="
+width:100px;" /></td>
<td class="noborder"><input name="
+TextBoxShortText" type="text" value="KMD.NI.DPSagsbehandler" id="Text
+BoxShortText" /></td>
<td class="noborder"><input type="
+submit" name="Button1" value="Opdater filter" id="Button1" /></td>
</tr>
</table>
</form>
<!--table content cut out here -->
</body>
</HTML>
Any hints and explanations much appreciated
Best Regards
allan dystrup
Update
Here's the basic code of relevance to the question - actually only do_POST, do_RSbase and do_page are of importance to the LWP navigation.
### Arg parsing, Initialization, IO setup cut out here...
### ==================================================================
+====
### do_POST -- Params:
### the URL, (odsweb01.kmd.dk [172.31.88.103]: http://rswatch/RSData.
+aspx)
### an arrayref or hashref for the key/value pairs,
### optionally: any header lines: (key,value, key,value)
### ==================================================================
+====
sub do_POST {
if ( ! $ua ) {
$ua = new LWP::UserAgent(keep_alive=>1,parse_head=>0);
$ua->credentials('rswatch:80', 'rswatch', "KMD\\z6and", 'xxxxxxx
+');
$ua->default_header('Referer' => "http:\/\/rswatch\/RSData.aspx?
+miljoe=$args{E}");
$ua->default_header('Accept-Language' => 'da');
push @{$ua->requests_redirectable}, 'POST';
$ua->cookie_jar( {} );
$ua->env_proxy(); }
my $resp = $ua->post(@_);
return ($resp->content, $resp->status_line, $resp->is_success, $res
+p)
if wantarray;
return unless $resp->is_success;
return $resp->content;
}
### ==================================================================
+====
### do_RSbase : Parse RSwatch DB by traversing <-- ('forrige') link ch
+ain
### ==================================================================
+====
### Termination: sub not_interesting :'$done' when ($S < $args{T}), cf
+.
### sub set_args : $tw = "20051103151100"; # 1.log date
sub do_RSbase {
# Start in Browsing mode
$browsing = 1;
print "Browsing page:\n";
# Parse 1.st and previous pages, until done
my $previous = "http://rswatch/RSData.aspx?miljoe=UDV";
for (my $p = 1; !$done; ) {
print ">" . $p++ . "\n";
usleep ($args{S}); # Pause and...
$previous = do_page($previous); # parse previous
+ page.
}
}
### ==================================================================
+====
### do_page : Parse RSWatch page
### ==================================================================
+====
sub do_page {
# --- Fetch page (1.page & back-links)
my $url = shift;
my @parms = [];
=cut # this doesn't work...
my @parms = [
'TextBoxProductID'=> 'KMD.NI.DPSagsbehandler',
'Button1' => 'Opdater filter',
];
=cut
my ($content, $message, $is_success) = do_POST("$url", @parms);
die "***ERROR: HTTP to $url:\r\n\t$message\n" unless $is_success;
#print "$content\n\n";
# --- Decode & Parse page
my $root = HTML::TreeBuilder->new;
$content = decode("utf8", $content);
$root->parse($content);
# --- Extract page backlink
my $node_prev = $root->find_by_attribute("id", "HyperLink2");
my $link_prev = $node_prev->attr("href");
# --- Process main log table
my @tables = $root->find_by_tag_name('table');
my @table_rows = $tables[2]->find_by_tag_name('tr');
do_summary(\@table_rows);
# --- Free parse resources
$root->eof;
#$root->dump;
$root->delete;
# --- Return link to previous page
return "http://rswatch/" . $link_prev; # or 0, if last page!
}
### ==================================================================
+====
### do_summary : Parse RSWatch log summary table
### ==================================================================
+====
### ------------------------------------------------------------------
+----
### Raise flags: !browsing if past -f(rom); $done if past -t(o).
sub not_interesting {
my $r_table_cells = shift;
my @table_cells = @{$r_table_cells};
my $S = ($table_cells[4]->as_text);
$S =~ s/[-: ]//g;
if ($S > $args{F}) { $browsing ||=1; return 1;} # Before from.. sk
+ip
if ($args{T} > $S) { $done = 1; return 1;} # After to... qu
+it
if ($browsing) { $browsing = 0; print "\n"; } # 0: Interesting!
return;
}
### ------------------------------------------------------------------
+----
### Parse each log $row to @log_record table on page
sub do_summary {
my $r_table_rows = shift; # ref param
my @table_rows = @{$r_table_rows}; # cast to array
shift(@table_rows); # discard header row
ROW:
# --- Process each <ProductID> $row to @log_record
foreach my $row (@table_rows) {
return if $done;
my @log_record;
my @table_cells = $row->find_by_tag_name('td');
if ( exists($table_cells[5]) &&
$table_cells[5]->as_text=~/DPSagsbehandler/i ) # TODO:read fr
+om config
{
# --- If interesting: build @log_record from HTML
next ROW if not_interesting(\@table_cells); # Skip out-of-bo
+unds
foreach my $cell (@table_cells) { push @log_record, $cell->as
+_text; }
# --- If E(rror): process row detailsand push on @log_record
my $type = $table_cells[1]->as_text; # [E(rror)|S|W|R|T]
if ($type =~ /E/i) {
my $detail_link = "http://rswatch/"
. $table_cells[0]->find_by_tag_name('a')->attr('href');
my $details = do_details($detail_link);
push @log_record, $details;
}
# --- Reformat and print @log_record to file (tee to STDOUT)
print_record(\@log_record);
}
}
}
### ==================================================================
+====
### do_details : Parse RSWatch details
### ==================================================================
+====
sub do_details {
# --- Fetch details page for $url
my $url = shift;
my ($content, $message, $is_success) = do_POST("$url", []);
die "***ERROR: POST to $url:\r\n\t$message\n" unless $is_success;
# --- Decode & Parse page
my $root = HTML::TreeBuilder->new;
$content = decode("utf8", $content);
$root->parse($content);
# --- Retrieve details text
my @tables = $root->find_by_tag_name('table');
my @table_rows = $tables[3]->find_by_tag_name('tr');
shift (@table_rows); # discard table header
my $details = $table_rows[0]->find_by_attribute("valign", "top")->a
+s_text();
# --- Free parse resources
$root->eof;
#$root->dump;
#print "\tSUMMARY: $url\n";
$root->delete;
return $details;
}
### ==================================================================
+====
### print_record : Print one log record
### ==================================================================
+====
sub print_record {
my $r_log_record = shift; # ref param
my @log_record = @{$r_log_record}; # cast to array
# --- Reformat log record
$log_record[4] =~ s/ /#/; # seperate date,time in T
+imeStamp
for my $i (1..2) { shift(@log_record); } # discard FejllogId & Typ
+e
my @print_record;
push @print_record, split('#', $log_record[2]); # TimeStamp date an
+d time
push @print_record, "<TYPE>"; # DPxxx -- Fill in
push @print_record, $log_record[1]; # Municipality No.
push @print_record, $log_record[0]; # User ID
# --- Parse ShortText
### TO-BE-DONE ###
push @print_record, "<S[EX][OF]"; # Service Exity´|eX
+it,Ok|False
push @print_record, $log_record[5]; # ShortText
# --- Print record
@print_record = map { "$_," } @print_record; # To CSV format...
my $print_record = "@print_record"; # - flatten
$print_record =~ s/\s*//g; # - zap whitespace
print $t "$print_record\n"; # - tee out!
}
### ==================================================================
+====
### MAIN
### ==================================================================
+====
### Init
set_args();
$t1 = time(); print scalar localtime,"\n";
initialize();
### Extract
do_RSbase();
### Cleanup
flock(OF,LOCK_UN);
close(OF);
$t2 = time(); print "\n", scalar localtime,"\n";
my ($h,$m,$s) = (localtime($t2-$t1))[2,1,0];
print "Elapsed: $m:$s\n";