chronicdose has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, I am a new Perl programmer, I started this summer working for a company writing web-crawlers, parsing various sites data. When I first ran into websites with JavaScript I found various work arounds, basically doing what the JS did without using the JS. My current issue is with an ASP.NET website. I was reading up on various tools I could use and I began to work with : HTML::TreeBuilderX::ASP_NET. Other modules that I have been using are WWW::Mechanize,LWP::UserAgent, HTML::TokeParser. The doPostBack JS methods were to complicated for me to understand to simply replicate the actions. The problem which I am currently running into is their are two separate links that are used on the site. The first is a <input type=image ..> which is simple enough to grab the content within that link.

my @inputs = $mech->find_all_inputs( type => 'image', name_regex => qr/$pattern1/, ); #the Pattern is simply a name that is unique to all the buttons I want + to access foreach my $i(@inputs){ my $temp = $i->name(); $mech->click_button(name => $temp); $tempContent = $mech->content; &getDetails($tempContent); # this is another function using tokepa +rser to grab info from the page linked by the images content $goToMoreDetails = $mech->uri();#variable to grab the current url, f +or future use. $mech->back(); #returning to original page. }

This code works fine, the problem is I need to go to the next page that has a new list if <input type="image"...> links, the "Button" that does this is hyper link with a doPostBack, using an img (not INPUT type IMAGE) as the click-able link.

<a title="Next Page" href="javascript:__doPostBack(&#39;ctl00$ContentB +ody$CtrlNotice$grdItems$ctl00$ctl03$ctl01$ctl26&#39;,&#39;&#39;)"><im +g title="Next Page" class="image2" src="/Images/Icons/next_16.gif" al +t="Next Page" style="border-width:0px;" /></a>

Using the HTML::TreeBuilderX::ASP_NET module I wrote the following code to handle this.

my $resp = $mech->response(); my $root = HTML::TreeBuilder->new_from_content( $resp->content ); #The next part is to grab the link element, it is a hack job, I wasn' +t able to get both tag-> a and title eq 'Next Page' in one line, whic +h would be cleaner. my @a_tags = $root->look_down( '_tag' , 'a' ); foreach my $atag(@a_tags){ my $temp = $atag->as_HTML; if($temp =~ 'title="Next Page"'){ $a = $atag; } } #This is code from the CPAN website for the module #It was noted to use an ->httpResponse, which doesn't exist #Since the response is the result of the request I have replaced it wi +th my $aspnet = HTML::TreeBuilderX::ASP_NET->new({ element => $a , baseURL =>$mech->uri ## takes into account posting r +edirects }); my $response = $mech->request($aspnet->httpRequest); my $content = $response->content; print $content; # I wanted to see if I got the proper html content

This code only grabs the current page I was on without going to the next page for some reason. So what I tried was actually sending the concatenated string created by the asp_net module like this

my $content = $mech->get($aspnet->httpRequest->as_string); print $content;

Passing the url as a string is how I would normally use the get i.e. $mech->get("http://www.google.ca"); however THIS is what results in the error. The "string" is to large for the get request. Is there any way I can extend the get requests max length so I can pass in the entire string, or is there something simple I am missing here to get the next pages content? Thanks in advance to anyone who looks at this. Liam

Replies are listed 'Best First'.
Re: "Request URI Too Large (The size of the required header is too large...."
by Anonymous Monk on Jun 28, 2011 at 14:54 UTC
    It says __doPostBack not __doGetBack, POST is not GET, POST does not have same limits as GET

      Thanks, very good point... I rewrote the code to make a post request so I could pass it into my $mech

      #List page is the BASE page from which I am starting my $aspContent= $aspnet->httpRequest->content; my $req = HTTP::Request->new(POST => "$LIST_PAGE"); $req->content_type('application/x-www-form-urlencoded'); $req->content("$aspContent"); my $res = $mech->request($req); if ($res->is_success) { $content = $res->content; } print $content;

      So what I assume is that this should be sent like :
      POST
      Host: $LIST_PAGE
      User-Agent: Mozilla/5.1
      Content-Type: application/x-www-form-urlencoded
      Content: With the content of the asp post here.
      This runs but however still returns the base page. So I am still back where I started :(

        If your company is in the business of doing web automation, consider asking your colleagues about Wireshark or whatever other protocol sniffer they use to analyze the network traffic.

        Whenever your browser is behaving different from your Perl script, that means that the browser is sending data to the server that is different from the data your Perl script sends. Find out and remove the difference, and the server will treat you just like it treats the browser.

        Also see WWW:Scripter and/or WWW::Mechanize::Firefox.