Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

LWP POST to a form on a 'secondary web page'

by ady (Deacon)
on Dec 25, 2006 at 18:15 UTC ( [id://591592]=perlquestion: print w/replies, xml ) Need Help??

ady has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I have a web server on our intranet (Win2k running MS IIS).
I want to access data on a web site (.NET aspx) on this server.

I must first enter data into a form on the site primary page, setting the field name 'miljoe' = value 'UDV, and do a 'GET /RSData.aspx?miljoe=UDV' to the server. This fetches a secondary page containing another form.

I must then repeatedly enter data into some fields (ie name 'TextBoxProductID' = value 'KMD.NI.DPSagsbehandler', and name 'Button1' = value 'Opdater filter') on this second form and POST it to fetch logging info from a DB, which the server enters into a HTML table on the page, from where i can retrieve it.

I have previously used Win32::IE::Mechanize to navigate the pages on this server, but in the current release i want to try LWP to accomplish the same task.

My problem is, that I can't get LWP to navigate to and enter data into the fields on the form on the second page. Can LWP only be used for shallow screen scraping of directly addressable web pages, and not for deeper navigation and extraction?

server: http://rswatch page1: /RSData.aspx # form with field 'miljoe' page2: /RSData.aspx?miljoe=UDV # form with field ''TextBoxProductID' +and button 'Button1'
Here's a trace of my POST to the server, -- it doesn't work.
POST /RSData.aspx?miljoe=UDV HTTP/1.1 TE: deflate,gzip;q=0.3 Connection: TE Authorization: Basic S01EXHo2YW5kOno2YW5keXl5 Host: rswatch User-Agent: libwww-perl/5.805 Content-Length: 151 Content-Type: application/x-www-form-urlencoded DropDownListType=-TextBoxGUID&-=TextBoxUserName&-=TextBoxKommunenr&-=T +extBoxProductID&KMD.NI.DPSagsbehandler=TextBoxShortText&-=Button1&Opd +ater+filter= HTTP/1.1 200 OK Date: Mon, 25 Dec 2006 14:25:36 GMT Server: Microsoft-IIS/6.0 MicrosoftOfficeWebServer: 5.0_Pub X-Powered-By: ASP.NET X-AspNet-Version: 1.1.4322 Set-Cookie: ASP.NET_SessionId=4dbkrgn4idtdwbuptotubemu; path=/ Cache-Control: private Content-Type: text/html; charset=utf-8 Content-Length: 11883 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > <HTML> <HEAD> <span id="Label2"><title>RSWatch - UDV</title></span> <meta content="Microsoft Visual Studio .NET 7.1" name="GENERAT +OR"> <meta content="C#" name="CODE_LANGUAGE"> <meta content="JavaScript" name="vs_defaultClientScript"> <meta content="http://schemas.microsoft.com/intellisense/ie5" +name="vs_targetSchema"> <LINK href="StyleSheet1.css" type="text/css" rel="stylesheet"> </HEAD> <body> <center> <table class="BodyTable"> <tr> <td class="TDheaderUnderline"><A href="default.asp +x">RSWatch</A> - <span id="Label1">UDV</span><a name="top">&nbs +p;</a></td> </tr> <tr> <td class="BodyTable"> <form name="Form1" method="post" action="RSDat +a.aspx?miljoe=UDV" id="Form1"> <input type="hidden" name="__VIEWSTATE" value="dDwxMTA1MDg5NDkzO3Q8O2w +8aTwxPjtpPDM+Oz47bDx0PHA8cDxsPFRleHQ7PjtsPFw8dGl0bGVcPlJTV2F0Y2ggLSBV +RFZcPC90aXRsZVw+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDxVRFY7Pj47Pjs7Pjs+P +js+QtBaNAQOnC4Eqk2prlcPA4K8wqw=" /> <table class="noborder"> <tr> <td class="noborder"><a id="HyperL +ink2" title="forrige" href="/RSData.aspx?Miljoe=UDV&amp;StartFejllogI +d=3118741"><--</a>&nbsp;&nbsp;&nbsp; <a id="HyperLink1" title="n..s +te" href="/RSData.aspx?Miljoe=UDV&amp;StartFejllogId=3118781">--></a> +&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a id="HyperLink3" title="til +top" href="/RSData.aspx?Miljoe=UDV&amp;StartFejllogId=2147483647">--> +></a></td> <td class="noborder">&nbsp;&nbsp;& +nbsp;&nbsp;&nbsp;&nbsp;</td> <td class="noborder"><select name= +"DropDownListType" id="DropDownListType"> <option selected="selected" value="-">-</option> <option value="E">E</option> <option value="S">S</option> <option value="W">W</option> <option value="R">R</option> <option value="T">T</option> </select></td> <td class="noborder"><input name=" +TextBoxGUID" type="text" value="-" id="TextBoxGUID" /></td> <td class="noborder"><input name=" +TextBoxUserName" type="text" value="-" id="TextBoxUserName" /></td> <td class="noborder"><input name=" +TextBoxKommunenr" type="text" value="-" id="TextBoxKommunenr" /></td> <td class="noborder"><input name=" +TextBoxProductID" type="text" value="-" id="TextBoxProductID" /></td> <td class="noborder"><input name=" +TextBoxShortText" type="text" value="-" id="TextBoxShortText" /></td> <td class="noborder"><input type=" +submit" name="Button1" value="Opdater filter" id="Button1" /></td> </tr> </table> </form> <!--table content cut out here --> </body> </HTML>
And here's a trace of manually entering the desired data and posting from a browser.
POST /RSData.aspx?miljoe=UDV HTTP/1.1 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, applicati +on/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, a +pplication/x-shockwave-flash, */* Referer: http://rswatch/RSData.aspx?miljoe=UDV Accept-Language: da Content-Type: application/x-www-form-urlencoded Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .N +ET CLR 1.1.4322; InfoPath.1) Host: rswatch Content-Length: 0 Connection: Keep-Alive Cache-Control: no-cache Cookie: ASP.NET_SessionId=0i4zi0q0uag51lypzvg4m0va Authorization: Negotiate TlRMTVNTUAABAAAAB4IIogAAAAAAAAAAAAAAAAAAAAAFA +SgKAAAAD0== HTTP/1.1 401 Unauthorized Content-Length: 83 Content-Type: text/html Server: Microsoft-IIS/6.0 WWW-Authenticate: Negotiate TlRMTVNTUAACAAAABgAGADgAAAAFgomixPDhPomZ5s +YAAAAAAAAAAI4AjgA+AAAABQLODgAAAA9LAE0ARAACAAYASwBNAEQAAQAQAE8ARABTAFc +ARQBCADAAMQAEABoAaQBuAHQAZQByAG4ALgBrAG0AZAAuAGQAawADACwATwBEAFMAVwBF +AEIAMAAxAC4AaQBuAHQAZQByAG4ALgBrAG0AZAAuAGQAawAFABoAaQBuAHQAZQByAG4AL +gBrAG0AZAAuAGQAawAAAAAA MicrosoftOfficeWebServer: 5.0_Pub X-Powered-By: ASP.NET Date: Mon, 25 Dec 2006 17:43:04 GMT <html><head><title>Error</title></head><body>Error: Access is Denied.< +/body></html> POST /RSData.aspx?miljoe=UDV HTTP/1.1 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, applicati +on/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, a +pplication/x-shockwave-flash, */* Referer: http://rswatch/RSData.aspx?miljoe=UDV Accept-Language: da Content-Type: application/x-www-form-urlencoded Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .N +ET CLR 1.1.4322; InfoPath.1) Host: rswatch Content-Length: 368 Connection: Keep-Alive Cache-Control: no-cache Cookie: ASP.NET_SessionId=0i4zi0q0uag51lypzvg4m0va Authorization: Negotiate TlRMTVNTUAADAAAAGAAYAGQAAAAYABgAfAAAAAYABgBIA +AAACgAKAE4AAAAMAAwAWAAAAAAAAACUAAAABYKIogUBKAoAAAAPSwBNAEQAWgA2AEEATg +BEAEgAMgA0ADkANgA0AG2gazXZgVp0AAAAAAAAAAAAAAAAAAAAAC4sefx6XWUzFigAY3I +xHngpT+49JULFTA== __VIEWSTATE=dDwxMTA1MDg5NDkzO3Q8O2w8aTwxPjtpPDM%2BOz47bDx0PHA8cDxsPFRl +eHQ7PjtsPFw8dGl0bGVcPlJTV2F0Y2ggLSBVRFZcPC90aXRsZVw%2BOz4%2BOz47Oz47d +DxwPHA8bDxUZXh0Oz47bDxVRFY7Pj47Pjs7Pjs%2BPjs%2BQtBaNAQOnC4Eqk2prlcPA4 +K8wqw%3D&DropDownListType=-&TextBoxGUID=-&TextBoxUserName=-&TextBoxKo +mmunenr=-&TextBoxProductID=-&TextBoxShortText=KMD.NI.DPSagsbehandler& +Button1=Opdater+filter HTTP/1.1 200 OK Date: Mon, 25 Dec 2006 17:43:18 GMT Server: Microsoft-IIS/6.0 MicrosoftOfficeWebServer: 5.0_Pub X-Powered-By: ASP.NET X-AspNet-Version: 1.1.4322 Cache-Control: private Content-Type: text/html; charset=utf-8 Content-Length: 11823 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > <HTML> <HEAD> <span id="Label2"><title>RSWatch - UDV</title></span> <meta content="Microsoft Visual Studio .NET 7.1" name="GENERAT +OR"> <meta content="C#" name="CODE_LANGUAGE"> <meta content="JavaScript" name="vs_defaultClientScript"> <meta content="http://schemas.microsoft.com/intellisense/ie5" +name="vs_targetSchema"> <LINK href="StyleSheet1.css" type="text/css" rel="stylesheet"> </HEAD> <body> <center> <table class="BodyTable"> <tr> <td class="TDheaderUnderline"><A href="default.asp +x">RSWatch</A> - <span id="Label1">UDV</span><a name="top">&nbs +p;</a></td> </tr> <tr> <td class="BodyTable"> <form name="Form1" method="post" action="RSDat +a.aspx?miljoe=UDV" id="Form1"> <input type="hidden" name="__VIEWSTATE" value="dDwxMTA1MDg5NDkzO3Q8O2w +8aTwxPjtpPDM+Oz47bDx0PHA8cDxsPFRleHQ7PjtsPFw8dGl0bGVcPlJTV2F0Y2ggLSBV +RFZcPC90aXRsZVw+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDxVRFY7Pj47Pjs7Pjs+P +js+QtBaNAQOnC4Eqk2prlcPA4K8wqw=" /> <table class="noborder"> <tr> <td class="noborder"><a id="HyperL +ink2" title="forrige" href="/RSData.aspx?Miljoe=UDV&amp;StartFejllogI +d=2769544"><--</a>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a id="HyperLink3" title="til +top" href="/RSData.aspx?Miljoe=UDV&amp;StartFejllogId=2147483647">--> +></a></td> <td class="noborder">&nbsp;&nbsp;& +nbsp;&nbsp;&nbsp;&nbsp;</td> <td class="noborder"><select name= +"DropDownListType" id="DropDownListType"> <option selected="selected" value="-">-</option> <option value="E">E</option> <option value="S">S</option> <option value="W">W</option> <option value="R">R</option> <option value="T">T</option> </select></td> <td class="noborder"><input name=" +TextBoxGUID" type="text" value="-" id="TextBoxGUID" style="width:150p +x;" /></td> <td class="noborder"><input name=" +TextBoxUserName" type="text" value="-" id="TextBoxUserName" style="wi +dth:80px;" /></td> <td class="noborder"><input name=" +TextBoxKommunenr" type="text" value="-" id="TextBoxKommunenr" style=" +width:80px;" /></td> <td class="noborder"><input name=" +TextBoxProductID" type="text" value="-" id="TextBoxProductID" style=" +width:100px;" /></td> <td class="noborder"><input name=" +TextBoxShortText" type="text" value="KMD.NI.DPSagsbehandler" id="Text +BoxShortText" /></td> <td class="noborder"><input type=" +submit" name="Button1" value="Opdater filter" id="Button1" /></td> </tr> </table> </form> <!--table content cut out here --> </body> </HTML>
Any hints and explanations much appreciated
Best Regards
allan dystrup
Update

Here's the basic code of relevance to the question - actually only do_POST, do_RSbase and do_page are of importance to the LWP navigation.
### Arg parsing, Initialization, IO setup cut out here... ### ================================================================== +==== ### do_POST -- Params: ### the URL, (odsweb01.kmd.dk [172.31.88.103]: http://rswatch/RSData. +aspx) ### an arrayref or hashref for the key/value pairs, ### optionally: any header lines: (key,value, key,value) ### ================================================================== +==== sub do_POST { if ( ! $ua ) { $ua = new LWP::UserAgent(keep_alive=>1,parse_head=>0); $ua->credentials('rswatch:80', 'rswatch', "KMD\\z6and", 'xxxxxxx +'); $ua->default_header('Referer' => "http:\/\/rswatch\/RSData.aspx? +miljoe=$args{E}"); $ua->default_header('Accept-Language' => 'da'); push @{$ua->requests_redirectable}, 'POST'; $ua->cookie_jar( {} ); $ua->env_proxy(); } my $resp = $ua->post(@_); return ($resp->content, $resp->status_line, $resp->is_success, $res +p) if wantarray; return unless $resp->is_success; return $resp->content; } ### ================================================================== +==== ### do_RSbase : Parse RSwatch DB by traversing <-- ('forrige') link ch +ain ### ================================================================== +==== ### Termination: sub not_interesting :'$done' when ($S < $args{T}), cf +. ### sub set_args : $tw = "20051103151100"; # 1.log date sub do_RSbase { # Start in Browsing mode $browsing = 1; print "Browsing page:\n"; # Parse 1.st and previous pages, until done my $previous = "http://rswatch/RSData.aspx?miljoe=UDV"; for (my $p = 1; !$done; ) { print ">" . $p++ . "\n"; usleep ($args{S}); # Pause and... $previous = do_page($previous); # parse previous + page. } } ### ================================================================== +==== ### do_page : Parse RSWatch page ### ================================================================== +==== sub do_page { # --- Fetch page (1.page & back-links) my $url = shift; my @parms = []; =cut # this doesn't work... my @parms = [ 'TextBoxProductID'=> 'KMD.NI.DPSagsbehandler', 'Button1' => 'Opdater filter', ]; =cut my ($content, $message, $is_success) = do_POST("$url", @parms); die "***ERROR: HTTP to $url:\r\n\t$message\n" unless $is_success; #print "$content\n\n"; # --- Decode & Parse page my $root = HTML::TreeBuilder->new; $content = decode("utf8", $content); $root->parse($content); # --- Extract page backlink my $node_prev = $root->find_by_attribute("id", "HyperLink2"); my $link_prev = $node_prev->attr("href"); # --- Process main log table my @tables = $root->find_by_tag_name('table'); my @table_rows = $tables[2]->find_by_tag_name('tr'); do_summary(\@table_rows); # --- Free parse resources $root->eof; #$root->dump; $root->delete; # --- Return link to previous page return "http://rswatch/" . $link_prev; # or 0, if last page! } ### ================================================================== +==== ### do_summary : Parse RSWatch log summary table ### ================================================================== +==== ### ------------------------------------------------------------------ +---- ### Raise flags: !browsing if past -f(rom); $done if past -t(o). sub not_interesting { my $r_table_cells = shift; my @table_cells = @{$r_table_cells}; my $S = ($table_cells[4]->as_text); $S =~ s/[-: ]//g; if ($S > $args{F}) { $browsing ||=1; return 1;} # Before from.. sk +ip if ($args{T} > $S) { $done = 1; return 1;} # After to... qu +it if ($browsing) { $browsing = 0; print "\n"; } # 0: Interesting! return; } ### ------------------------------------------------------------------ +---- ### Parse each log $row to @log_record table on page sub do_summary { my $r_table_rows = shift; # ref param my @table_rows = @{$r_table_rows}; # cast to array shift(@table_rows); # discard header row ROW: # --- Process each <ProductID> $row to @log_record foreach my $row (@table_rows) { return if $done; my @log_record; my @table_cells = $row->find_by_tag_name('td'); if ( exists($table_cells[5]) && $table_cells[5]->as_text=~/DPSagsbehandler/i ) # TODO:read fr +om config { # --- If interesting: build @log_record from HTML next ROW if not_interesting(\@table_cells); # Skip out-of-bo +unds foreach my $cell (@table_cells) { push @log_record, $cell->as +_text; } # --- If E(rror): process row detailsand push on @log_record my $type = $table_cells[1]->as_text; # [E(rror)|S|W|R|T] if ($type =~ /E/i) { my $detail_link = "http://rswatch/" . $table_cells[0]->find_by_tag_name('a')->attr('href'); my $details = do_details($detail_link); push @log_record, $details; } # --- Reformat and print @log_record to file (tee to STDOUT) print_record(\@log_record); } } } ### ================================================================== +==== ### do_details : Parse RSWatch details ### ================================================================== +==== sub do_details { # --- Fetch details page for $url my $url = shift; my ($content, $message, $is_success) = do_POST("$url", []); die "***ERROR: POST to $url:\r\n\t$message\n" unless $is_success; # --- Decode & Parse page my $root = HTML::TreeBuilder->new; $content = decode("utf8", $content); $root->parse($content); # --- Retrieve details text my @tables = $root->find_by_tag_name('table'); my @table_rows = $tables[3]->find_by_tag_name('tr'); shift (@table_rows); # discard table header my $details = $table_rows[0]->find_by_attribute("valign", "top")->a +s_text(); # --- Free parse resources $root->eof; #$root->dump; #print "\tSUMMARY: $url\n"; $root->delete; return $details; } ### ================================================================== +==== ### print_record : Print one log record ### ================================================================== +==== sub print_record { my $r_log_record = shift; # ref param my @log_record = @{$r_log_record}; # cast to array # --- Reformat log record $log_record[4] =~ s/ /#/; # seperate date,time in T +imeStamp for my $i (1..2) { shift(@log_record); } # discard FejllogId & Typ +e my @print_record; push @print_record, split('#', $log_record[2]); # TimeStamp date an +d time push @print_record, "<TYPE>"; # DPxxx -- Fill in push @print_record, $log_record[1]; # Municipality No. push @print_record, $log_record[0]; # User ID # --- Parse ShortText ### TO-BE-DONE ### push @print_record, "<S[EX][OF]"; # Service Exity´|eX +it,Ok|False push @print_record, $log_record[5]; # ShortText # --- Print record @print_record = map { "$_," } @print_record; # To CSV format... my $print_record = "@print_record"; # - flatten $print_record =~ s/\s*//g; # - zap whitespace print $t "$print_record\n"; # - tee out! } ### ================================================================== +==== ### MAIN ### ================================================================== +==== ### Init set_args(); $t1 = time(); print scalar localtime,"\n"; initialize(); ### Extract do_RSbase(); ### Cleanup flock(OF,LOCK_UN); close(OF); $t2 = time(); print "\n", scalar localtime,"\n"; my ($h,$m,$s) = (localtime($t2-$t1))[2,1,0]; print "Elapsed: $m:$s\n";

Replies are listed 'Best First'.
Re: LWP POST to a form on a 'secondary web page'
by andyford (Curate) on Dec 25, 2006 at 19:02 UTC

    Can LWP only be used for shallow screen scraping of directly addressable web pages, and not for deeper navigation and extraction?
    LWP should do the job. The only place I've gotten stuck using LWP was when the web server is using NTLM authentication.
    LWP has some support for NTLM, but it seems to not work with newer implementations.

    You will need to post some code to get good help though.

    non-Perl: Andy Ford

      Tnx Andy,
      I got the NTLM authorization to work;
      I'm trying LWP partly out of curiosity and partly because I expect it to be faster than Win32::IE::Mechanize.
      I've updated the node with the relevant code.
      It runs as is, and does what I want it to do, but it's not as efficient as it could be, because the POST of the field values to the 'second page' doesn't work, and thus the database retrieves all values (ie doesn't filter by the fields in the form on the page)
      Best regards,
      allan
Re: LWP POST to a form on a 'secondary web page'
by ForgotPasswordAgain (Priest) on Dec 25, 2006 at 18:54 UTC
    I see no code. You probably need to use cookie_jar and/or set the referer. I'm not sure why you wouldn't use WWW::Mechanize, instead of raw LWP::UserAgent, though.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://591592]
Approved by madbombX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-19 01:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found