zuma53 has asked for the wisdom of the Perl Monks concerning the following question:

Hi--

I am trying to grab a page with the following code:
use LWP::Simple; use LWP::UserAgent; $browser = LWP::UserAgent->new; $browser->default_headers->push_header('User-Agent' => 'Mozilla/4.0 (c +ompatible; MSIE 7.0; Windows NT 5.1; iOpus-I-M; GTB6; .NET CLR 2.0.50 +727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; InfoPath.2; .NET +CLR 3.0.4506.2152; .NET CLR 3.5.30729)'); $browser->default_headers->push_header('Host' => "brtweb.phila.gov"); $browser->default_headers->push_header('Connection' => "close"); $response = $browser->get('http://brtweb.phila.gov/brt.apps/Search/Sea +rchResults.aspx?id=6546003202'); print $response->content;
Nothing fancy or anything.

But yet when I pull down the page with the above code, the results I get most, but not all, of the same page if I use a browser to do the job. Principally, I am missing 2 chunks that are part of a 3-part tabbed view on the page (in this case, Perl only gets "Account Information", but not the "Account Details" or "Property Valuation" tabs. They never show up from a Perl request. And I do not see another sub-request being made to pull the data down.

I have tried sending every header needed, as shown by a trace from www.rexswain.com (Thanks!) and even that site is able to fetch the entire page.

Any ideas as to what I am not providing/doing wrong here?

Thanks!

Replies are listed 'Best First'.
Re: lwp not retieving the same page as from a browser
by moritz (Cardinal) on Aug 25, 2009 at 07:42 UTC
    That page uses Javascript to display the search results - may that's causing you grief.

    Disable javascript in your browser and see if what you get in the browser matches what you get with LWP.

    Perl 6 projects - links to (nearly) everything that is Perl 6.
Re: lwp not retieving the same page as from a browser
by james2vegas (Chaplain) on Aug 25, 2009 at 07:53 UTC
    Define 'pull down the page'. When I connect to that page without running JavaScript I see no data, but if i look at the source some (perhaps all) of it is there. The Javascript hides and unhides the various data parts. What are you using to parse the HTML?
      pull down the page = what is returned from the Get request

      I turned off Javascript on the browser and the 'missing' data is present in the returned page (i.e. it has no effect on what gets returned; it still gets more data than via perl).

      I guess the best way I can describe this is:

      Browser:
        Headers + Get => AxyzB

      Perl:
        Headers + Get => AxB

      where ABxyz are sections of HTML returned. xyz are sections associated with the tabbed areas.

      I am sending the same headers in perl (as far as I know) that were sent/shown via rexswain.com.

        If I change your code to this (changing your User-Agent to the one used by rexswain.com), and using the normal call to set user-agent, viz:
        use LWP::Simple; use LWP::UserAgent; $browser = LWP::UserAgent->new(); $browser->agent('Mozilla/5.0 (X11; U; OpenBSD i386; en-US; rv:1.8.1.22 +) Gecko/20090626 SeaMonkey/1.1.17 XpcomViewer/0.9'); $response = $browser->get('http://brtweb.phila.gov/brt.apps/Search/Sea +rchResults.aspx?id=6546003202'); print $response->content;

        I then get the same amount of lines and text as rexswain.com does, I have not verified the content, can you check? Using your User-Agent string returns a 41437-byte response, and the rexswain User-Agent (used above) returns 43314 bytes, which is the same as the rexswain.com form returns. Perhaps sending Mozilla/4.0 instead of 5.0 was triggering some code path on their ASP code you would not see otherwise.