Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

get with LWP drops HTML

by jialanw (Initiate)
on Sep 30, 2008 at 04:32 UTC ( [id://714462]=perlquestion: print w/replies, xml ) Need Help??

jialanw has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm trying to write a script to download some webpages using LWP.

The problem is that the responses I'm getting are incomplete webpages - they only contain some of the content of what I see in my normal browser, omitting seemingly random pieces of code - comments, javascript, forms, etc. Somehow for this given page even a simple 'get' command yields this issue. I've tried using the ->as_string, ->content, and :content_file attributes, but all of them have the missing code problem. Also, the content from 'get' which is saved to the file actually changes on different calls to the program.

I've tried it with other websites and it seems to work - is this caused by the website I'm trying to download from? How can I get around it?

I do use a cookie to log onto the site, but I don't think that's the problem.

Here's the code:

-----------------

use LWP; $ua =LWP::UserAgent->new; $res = $ua->get("https://ecf.$district.uscourts.gov/cgi-bin/iquery.pl" +, ':content_file' => 'test.html');

-----------------

As an example of the lost code, here's the code from going to the site and using "Save as" from the browser:

<script language="JavaScript"> FirstField="case_num";</script> <form e +nctype="multipart/form-data" method="post" action="/cgi-bin/iquery.pl +?109027233035598-L_758_0-1"> <!--ShowPage(iquery.htm)--> <!-- rcsid="$Header: /usr/local/cvsroot/ba +nkruptcy/web/html/iquery.htm,v 3.6 2005/02/07 20:00:34 gamores Exp $" + -->

Here's what I get from the saved content file from "get":

<SCRIPT LANGUAGE="JavaScript"> FirstField="case_num";</SCRIPT><!-ShowP +age(iquery.htm)-> <!-- rcsid="$Header: /usr/local/cvsroot/bankruptcy/ +web/html/iquery.htm,v 3.6 2005/02/07 20:00:34 gamores Exp $" -->

Notice that the form has disappeared when using "get". Does "get" reformat the code?

Thanks!!

Replies are listed 'Best First'.
Re: get with LWP drops HTML
by NetWallah (Canon) on Sep 30, 2008 at 05:00 UTC
    The default "Useragent" for LWP is "libwww-perl/#.##".

    It is quite possible that the cgi query you are attempting to "get" sends different responses based on what it thinks the user agent is.

    Read the docs on "$ua = LWP::UserAgent->new( %options )" to see how you can specify the agent.

    Another (far-fetched, remote) possibility is that the form is generated dynamically by javascript. Since your LWP will not normally execute javascript - you see no form.

    TO see the differences in interaction between a regular browser and your LWP, you will probably need a network sniffer, like Wireshark.

         Have you been high today? I see the nuns are gay! My brother yelled to me...I love you inside Ed - Benny Lava, by Buffalax

Re: get with LWP drops HTML
by smiffy (Pilgrim) on Sep 30, 2008 at 05:03 UTC

    I can't test anything here as you haven't actually given us a full URI which you are trying to retrieve. However, here are some things to consider:

    • The URI you are trying to call in the first example looks like it might be expecting form data to be POSTed to it. If you try to GET a page that is expecting form data through POST, you are likely to run into problems.
    • There is a possibility that content may be generated by scripts. When you save a page from your browser, you may well be saving content that has been rendered by scripts, not just what gets delivered by your HTTP request.

    I'm not saying that either of these are the problem, but they should be borne in mind when trying to retrieve pages using simpler user agents such as LWP. Before you start writing your own user agents in Perl or any other language, you really need to know what's going on with the target site. Forms, scripts, cookies and the like that are handled quietly by graphical browsers may need to be addressed in your code.

    The best way to get a better picture of what is going on is to use a different browser - something like Lynx. Text-only browsers are far closer to user agents that you might created using LWP than graphical ones like Firefox, Opera, IE, etcetera. At the very least, I would suggest that you test this all out in a browser with JavaScript disabled. Lynx, links, wget and friends are, however, the tools that I'd recommend to get to the bottom of this.

    Hope this helps.

      Thanks all, for all of the info.

      I've tried using wget for some sample pages and the code-dropping does not seem to be present there. It would still be nice to use Perl to go through all of the forms I need, but I am trying now to just find a solution to get the data I need using a hack with the incomplete data from LWP to generate WGET commands.

      It would still be nice to know for future reference what the hell is going on though!

        It would still be nice to use Perl to go through all of the forms I need

        Please do yourself a favour then and look at good monk petdance's WWW::Mechanize. It will safe you countless hours of work ;)

        --
        b10m
Re: get with LWP drops HTML
by ikegami (Patriarch) on Sep 30, 2008 at 04:51 UTC

    Does "get" reformat the code?

    Not at all. ("Save as" from the browser does, at least in some browsers.)

      "Save as" from the browser does, at least in some browsers.
      Even "View Page source" in Firefox does it, if I remember correctly, which I consider wrong behaviour.
Re: get with LWP drops HTML
by jialanw (Initiate) on Oct 07, 2008 at 22:36 UTC
    I think I finally solved it - by running the script on a server instead of my desktop. I have no idea why I am having the dropped code problem in the first place though ...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://714462]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-04-19 19:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found