shu has asked for the wisdom of the Perl Monks concerning the following question:

Hi all... I have written the following code to get the HTML content of a page in an external text file. For some reason the page cant be accesses tho there is no problem accessing it via a browser. The code works fine with other pages. I was owndering if anyone could tell me why. If it is some cookie/php/jsp problem then how to solve it?? the code is:
use HTML::TokeParser; use HTML::TokeParser::Simple; require LWP::UserAgent; use HTML::Parser; use Data:umper; $ua = LWP::UserAgent->new; my $file="html_content.txt"; my $output_file="parsed_data.txt"; open(FH,">staff.txt"); my $url="http://www.accountancy.smu.edu.sg/facultystaff/faculty.htm"; $ARGV[0]=$url; print "Parsing $url\n\n"; $request = HTTP::Request->new('GET', $url); $response = $ua->request($request); if ($response->is_success) { print "response successful!\n\n"; print FH $response->content(); } else { print "bad luck! unsuccessful request\n\n"; } close (FH);
and the output is always giving me

"bad luck! unsuccessful request"
for the page:
http://www.accountancy.smu.edu.sg/f...aff/faculty.htm

I think its a cookie problem but im not sure how i can set the cookies and then make the request for a page...

Please help

Thanx in advance

Replies are listed 'Best First'.
Re: Help with LWP::UserAgent
by dws (Chancellor) on Feb 06, 2004 at 07:40 UTC
    I think its a cookie problem ...

    When in doubt, talk to the web server directly:

    dws% telnet www.accountancy.smu.edu.sg 80 Trying 202.161.41.246... Connected to iisinternetnlbs.smu.edu.sg. Escape character is '^]'. GET /facultystaff/faculty.htm HTTP/1.0 HTTP/1.1 403 Forbidden Content-Length: 1409 Content-Type: text/html Server: Microsoft-IIS/6.0 X-Powered-By: ASP.NET Date: Fri, 06 Feb 2004 07:35:37 GMT Connection: close ... <h1>The page must be viewed over a secure channel</h1>

    This after successfully pulling up the page in IE 5.5. I suspect that the server is being rude to non IE browsers. Try setting the user agent string to something that looks like what IE might issue. Consult the LWP pod for details.

      if so replace the line
      $ua = LWP::UserAgent->new; with $ua = LWP::UserAgent->new( agent => "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" , );
Re: Help with LWP::UserAgent
by Corion (Patriarch) on Feb 06, 2004 at 07:36 UTC

    The hard way of approaching this problem is to install a logging proxy (written with, for example, HTTP::Proxy) or a network sniffer (for example ethereal or something using Net::PCap) between your browser and the network traffic, see what gets sent between the two, and replay that from the script.

    This is hard, because you will see much stuff that is unrelated and/or have to set up things.

    The easy way is to use WWW::Mechanize, which tries relatively hard to emulate a browser. It handles cookies already for you and it has easy ways of masquerading as a browser as well.

    If a script using WWW::Mechanize does not work, then you will have to fall back onto the above soutions.

    I can't test it from here, but I think that the following WWW::Mechanize script recreates what your script does:

    use strict; use WWW::Mechanize; my $agent = WWW::Mechanize->new(); my $url = '"http://www.accountancy.smu.edu.sg/facultystaff/faculty.htm +'; $agent->get($url); print "Got return code ", $agent->code, "\n"; open FH, ">", "staff.txt"; print FH $agent->content; close FH;
Re: Help with LWP::UserAgent
by matthewb (Curate) on Feb 06, 2004 at 07:38 UTC

    I guess the `use Data::umper;' bit is a typo, in which case I can see nothing immediately wrong with your code. If you are at work, is it possible that your web browser has been configured to use a proxy for internet access?

    If that is the case, you may find this node useful.

    MB
Re: Help with LWP::UserAgent
by pelagic (Priest) on Feb 06, 2004 at 08:06 UTC
    Hi!
    I found some code in my code-stash:
    #!/usr/bin/perl use LWP::Simple; $doc = get 'http://www.accountancy.smu.edu.sg/facultystaff/faculty.htm +'; open (SAVE,">test.txt") || die "Can't Open test.txt for writing: $!\n" +; binmode (SAVE); print SAVE $doc; close (SAVE);

    I can save the html file with that ...
    pelagic

    I can resist anything but temptation.
Re: Help with LWP::UserAgent
by kalamiti (Friar) on Feb 06, 2004 at 09:32 UTC
    I don't understand
    $ARGV[0]=$url;

    ? is that another typo ? a hack ?
Re: Help with LWP::UserAgent
by shu (Initiate) on Feb 06, 2004 at 09:05 UTC
    yeah Data:umper was a typo but that wasnt the problem..

    Hmm i tried almost all the suggestion here but it didnt seem to work:( Ok what if I need to access a page which requires authorisation instead. That is i need to set the cookies for a username and passowrd and add that to the request header before accessing the page. How can I do this via a perl script?
    Any suggestions?

      Read the documentation for LWP, LWP::Simple and WWW::Mechanize (and maybe even WWW::Mechanize::Shell). These do what you need, including authentification. WWW::Mechanize::Shell will try its best and write the Perl script for you, but maybe you're better off writing the script using WWW::Mechanize.