lwp not retieving the same page as from a browser

zuma53 has asked for the wisdom of the Perl Monks concerning the following question:

Hi--

I am trying to grab a page with the following code:

use LWP::Simple;
use LWP::UserAgent;

$browser = LWP::UserAgent->new;

$browser->default_headers->push_header('User-Agent' => 'Mozilla/4.0 (c
+ompatible; MSIE 7.0; Windows NT 5.1; iOpus-I-M; GTB6; .NET CLR 2.0.50
+727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; InfoPath.2; .NET 
+CLR 3.0.4506.2152; .NET CLR 3.5.30729)');
$browser->default_headers->push_header('Host' => "brtweb.phila.gov");
$browser->default_headers->push_header('Connection' => "close");

$response = $browser->get('http://brtweb.phila.gov/brt.apps/Search/Sea
+rchResults.aspx?id=6546003202');
print $response->content;
[download]

Nothing fancy or anything.

But yet when I pull down the page with the above code, the results I get most, but not all, of the same page if I use a browser to do the job. Principally, I am missing 2 chunks that are part of a 3-part tabbed view on the page (in this case, Perl only gets "Account Information", but not the "Account Details" or "Property Valuation" tabs. They never show up from a Perl request. And I do not see another sub-request being made to pull the data down.

I have tried sending every header needed, as shown by a trace from www.rexswain.com (Thanks!) and even that site is able to fetch the entire page.

Any ideas as to what I am not providing/doing wrong here?

Thanks!

Comment on lwp not retieving the same page as from a browser Download Code

Replies are listed 'Best First'.
Re: lwp not retieving the same page as from a browser by moritz (Cardinal) on Aug 25, 2009 at 07:42 UTC
That page uses Javascript to display the search results - may that's causing you grief. Disable javascript in your browser and see if what you get in the browser matches what you get with LWP. Perl 6 projects - links to (nearly) everything that is Perl 6.	[reply]
Re: lwp not retieving the same page as from a browser by james2vegas (Chaplain) on Aug 25, 2009 at 07:53 UTC
Define 'pull down the page'. When I connect to that page without running JavaScript I see no data, but if i look at the source some (perhaps all) of it is there. The Javascript hides and unhides the various data parts. What are you using to parse the HTML?	[reply]
Re^2: lwp not retieving the same page as from a browser by zuma53 (Beadle) on Aug 26, 2009 at 07:15 UTC
pull down the page = what is returned from the Get request I turned off Javascript on the browser and the 'missing' data is present in the returned page (i.e. it has no effect on what gets returned; it still gets more data than via perl). I guess the best way I can describe this is: Browser: Headers + Get => AxyzB Perl: Headers + Get => AxB where ABxyz are sections of HTML returned. xyz are sections associated with the tabbed areas. I am sending the same headers in perl (as far as I know) that were sent/shown via rexswain.com.	[reply]
Re^3: lwp not retieving the same page as from a browser by james2vegas (Chaplain) on Aug 26, 2009 at 08:01 UTC
If I change your code to this (changing your User-Agent to the one used by rexswain.com), and using the normal call to set user-agent, viz: `use LWP::Simple; use LWP::UserAgent; $browser = LWP::UserAgent->new(); $browser->agent('Mozilla/5.0 (X11; U; OpenBSD i386; en-US; rv:1.8.1.22 +) Gecko/20090626 SeaMonkey/1.1.17 XpcomViewer/0.9'); $response = $browser->get('http://brtweb.phila.gov/brt.apps/Search/Sea +rchResults.aspx?id=6546003202'); print $response->content;` [download] I then get the same amount of lines and text as rexswain.com does, I have not verified the content, can you check? Using your User-Agent string returns a 41437-byte response, and the rexswain User-Agent (used above) returns 43314 bytes, which is the same as the rexswain.com form returns. Perhaps sending Mozilla/4.0 instead of 5.0 was triggering some code path on their ASP code you would not see otherwise.	[reply] [d/l]
Re^4: lwp not retieving the same page as from a browser by zuma53 (Beadle) on Aug 26, 2009 at 19:46 UTC