Heidegger has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'm trying to retrieve a web page with LWP::UserAgent. With most of the websites (http://www.bokstas.lt), it returns me the HTML string of the page. However, when I'm trying to retrieve one website (http://www.finasta.lt), I get an empty string and get no error. The website I'm trying to retrive has a long whitespace trace in the beginning. Has anyone encountered a problem when retrieving such content? My retrieve function is below:

sub retrieve_url() { my $ua = LWP::UserAgent->new; my $response = $ua->get(shift); if ($response->is_success) { return $response->content; } else { die $response->status_line; } }

Any help will be appreciated.

Replies are listed 'Best First'.
Re: Retrieviing HTML with LWP::UserAgent
by saskaqueer (Friar) on May 26, 2004 at 10:40 UTC

    All I can say is that it works fine for me. That sure is a lot of preceeding whitespace, but I get everything nonetheless. Maybe try stripping the whitespace? Other than that, try giving us some more information if you can?

    print retrieve_url( 'http://www.finasta.lt' ); # note - don't use the parentheses after the sub name # we're not needing prototypes here sub retrieve_url { my $url = shift; my $ua = LWP::UserAgent->new(); my $res = $ua->get( $url ); if ( $res->is_success() ) { my $content = $res->content(); $content =~ s!\A\s+!!; return( $content ); } else { die( "retrieval error: ", $res->status_line() ); } }
Re: Retrieving HTML with LWP::UserAgent
by Somni (Friar) on May 26, 2004 at 10:45 UTC

    I'm not able to replicate your problem here. The URL you said was returning no content is working just fine; it is a little slow, however. Perhaps the problem is in the calling code; how is it dealing with the content? How have you determined it has no content?

    In addition, the parens on your subroutine declaration are wrong, sub retrieve_url () {...} declares retrieve_url as taking no arguments; the parens there are a prototype (see perlsub). You should be using sub retrieve_url { ... }. You probably haven't seen any errors from this because the declaration isn't seen soon enough, but you should be seeing warnings about that (you are using warnings, right?).

      This reply is very impressive.

      Just to extend a little bit on this. There are two ways, we might call this sub. We either call it before it is defined, or call it after it is defined. Let's look at both of them:

      • Call before it is defined.

        use LWP::UserAgent; use strict; use warnings; print retrieve_url( 'http://www.finasta.lt' ); sub retrieve_url() { my $url = shift; my $ua = LWP::UserAgent->new(); my $res = $ua->get( $url ); if ( $res->is_success() ) { my $content = $res->content(); $content =~ s!\A\s+!!; return( $content ); } else { die( "retrieval error: ", $res->status_line() ); } }

        In this case, you would be warned that:

        main::retrieve_url() called too early to check prototype at a.pl line +6.

        However, Perl will close one eye and let the code run "successfully".

      • Call after the sub is defined. If you want to get rid of that annoying warning, you have to define the sub first, then Perl will stop the program from running, which is ideal to me.

        use LWP::UserAgent; use strict; use warnings; sub retrieve_url() { my $url = shift; my $ua = LWP::UserAgent->new(); my $res = $ua->get( $url ); if ( $res->is_success() ) { my $content = $res->content(); $content =~ s!\A\s+!!; return( $content ); } else { die( "retrieval error: ", $res->status_line() ); } } print retrieve_url( 'http://www.finasta.lt' );

        Try it, and you get:

        Too many arguments for main::retrieve_url at a.pl line 23, near "'http +://www.fi asta.lt' )" Execution of a.pl aborted due to compilation errors.
Re: Retrieviing HTML with LWP::UserAgent
by markjugg (Curate) on May 26, 2004 at 14:49 UTC
    I have a couple of suggestions which don't directly address your question, but may help:

    Try WWW::Mechanize instead. It's based on LWP::UserAgent, but with a friendlier interface. Also, make sure you have the most recent version of LWP::UserAgent. That would explain why others have gotten different results.

Re: Retrieviing HTML with LWP::UserAgent
by linuxfro (Novice) on May 26, 2004 at 23:58 UTC
    i have tried your code and works fine.. there is lots of whitespace but other than that.. works fine..
    #!/usr/bin/perl -w use strict; use LWP::UserAgent; sub retrieve_url { my $ua = LWP::UserAgent->new; my $response = $ua->get(shift); if ($response->is_success) { return $response->content;} else {die $response->status_line;} } my $var = retrieve_url('http://www.finasta.lt'); print $var;

    maybe somewhere in your script you maybe prematurely cropping out the variable you are inserting the content into..
Re: Retrieviing HTML with LWP::UserAgent
by crenz (Priest) on May 26, 2004 at 22:45 UTC

    While not directly related to your problem, sometimes people do block access to their pages for robots. Maybe the page is blocked for" LWP-UserAgent/$Yourversion"?