cdherold has asked for the wisdom of the Perl Monks concerning the following question:

I've been using LWP::Simple successfully in a number of programs to fetch web pages with
$url = "http://www.whatever.com"; $body = get("$url"); print "$body";
Pretty simple ... but all the sudden I've come upon a URL to which this is not working. I've cross tested the code with URLs that work and then put in this new site URL (http://wire.ap.org/APnews/center_minor.html?FRONTID=SCIENCE), but all it prints out is one big blank.

Has anyone seen this before? Is it possible that there some security system on this page that will not allow it to be retrieved?

cdherold

Replies are listed 'Best First'.
(crazyinsomniac) Re: LWP::SIMPLE fails on certain URL
by crazyinsomniac (Prior) on Jan 27, 2002 at 15:51 UTC
    In situations like these, you must use LWP::Debug
    $>perl -MLWP::Debug=+ -MLWP::Simple -we"print get('http://wire.ap.org/APnews/center_minor.html?FRONTID=SCIENCE')" LWP::UserAgent::new: () LWP::UserAgent::request: () LWP::UserAgent::request: Simple response: Bad Request Use of uninitialized value in print at -e line 1.
    a simple LWP::Simple::get($url) didn't work for me either, even though I could see the page in me browser, but a LWP::Simple::getstore did, so I debugged that too
    $>perl -MLWP::Debug=+ -MLWP::Simple -we"print getstore('http://wire.ap.org/APnews/center_minor.html?FRONTID=SCIENCE +','file.tx t')" LWP::UserAgent::new: () LWP::UserAgent::request: () LWP::UserAgent::simple_request: GET http://wire.ap.org/APnews/center_m +inor.html? FRONTID=SCIENCE LWP::UserAgent::_need_proxy: Not proxied LWP::Protocol::http::request: () LWP::Protocol::http::request: GET /APnews/center_minor.html?FRONTID=SC +IENCE HTTP /1.0 Host: wire.ap.org User-Agent: LWP::Simple/5.51 LWP::Protocol::http::request: reading response LWP::Protocol::http::request: HTTP/1.1 302 Found Date: Sun, 27 Jan 2002 10:36:34 GMT Server: Apache/1.3.12 (Unix) mod_perl/1.23 Location: /public_pages/WirePortal.pcgi Connection: close Content-Type: text/html; charset=iso-8859-1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>302 Found</TITLE> </HEAD><BODY> <H1>Found</H1> The document has moved <A HREF="/public_pages/WirePortal.pcgi">here</A +>.<P> <HR> <ADDRESS>Apache/1.3.12 Server at wire.ap.org Port 80</ADDRESS> </BODY></HTML> LWP::Protocol::http::request: HTTP/1.1 302 Found LWP::Protocol::collect: read 277 bytes LWP::UserAgent::request: Simple response: Found LWP::UserAgent::request: () LWP::UserAgent::simple_request: GET http://wire.ap.org/public_pages/Wi +rePortal.p cgi LWP::UserAgent::_need_proxy: Not proxied LWP::Protocol::http::request: () LWP::Protocol::http::request: GET /public_pages/WirePortal.pcgi HTTP/1 +.0 Host: wire.ap.org User-Agent: LWP::Simple/5.51 LWP::Protocol::http::request: reading response LWP::Protocol::http::request: HTTP/1.1 200 OK Connection: close Date: Sun, 27 Jan 2002 08:31:02 GMT Server: Apache/1.3.12 (Unix) mod_perl/1.23 Content-MD5: MA6SFpj03gSVT0/n/+ZEnA Content-Type: text/html; charset=ISO-8859-1 Title: Testing JavaScript <!-- pcache/1.7.3 Cache loaded from file on warrant Sun Jan 27 03:31:0 +6 2002 --> <!-- $Id: WirePortal.pcgi,v 1.10 2001/12/13 21:04:53 jxu Exp $ --> <HTML> <HEAD> <META HTTP-EQUIV=Refresh CONTENT="1; URL=/public_pages/WirePortal.pcgi +/nojs.html "> <TITLE>Testing JavaScript</TITLE> <script language="JavaScript"> <!-- function gogo() { self.location.href='/public_pages/WirePortal.pcgi/us_portal.html' } // --> </script> <BODY BGCOLOR="#FFFFFF" onload="gogo()"> </BODY> </HTML> LWP::Protocol::http::request: HTTP/1.1 200 OK LWP::Protocol::collect: read 478 bytes LWP::UserAgent::request: Simple response: OK 200
    I'd definetly say this is a bug in LWP::UserAgent, a ++ to anyone who takes the time and figures out where it is.

    Below is sub LWP::UserAgent::request, which is where LWP::Simple::get seems to fail. But first, here is the line in the debug line in sub request which gives us the interesting error

    LWP::Debug::debug('Simple response: ' . (HTTP::Status::status_message($code) || "Unknown code $code"));
    =item $ua->request($request, $arg [, $size]) Process a request, including redirects and security. This method may actually send several different simple requests. The arguments are the same as for C<simple_request()>. =cut sub request { my($self, $request, $arg, $size, $previous) = @_; LWP::Debug::trace('()'); my $response = $self->simple_request($request, $arg, $size); my $code = $response->code; $response->previous($previous) if defined $previous; LWP::Debug::debug('Simple response: ' . (HTTP::Status::status_message($code) || "Unknown code $code")); if ($code == &HTTP::Status::RC_MOVED_PERMANENTLY or $code == &HTTP::Status::RC_MOVED_TEMPORARILY) { # Make a copy of the request and initialize it with the new URI my $referral = $request->clone; # And then we update the URL based on the Location:-header. my $referral_uri = $response->header('Location'); { # Some servers erroneously return a relative URL for redirects +, # so make it absolute if it not already is. local $URI::ABS_ALLOW_RELATIVE_SCHEME = 1; my $base = $response->base; $referral_uri = $HTTP::URI_CLASS->new($referral_uri, $base) ->abs($base); } $referral->url($referral_uri); return $response unless $self->redirect_ok($referral); # Check for loop in the redirects my $count = 0; my $r = $response; while ($r) { if (++$count > 13 || $r->request->url->as_string eq $referral_uri->as_strin +g) { $response->header("Client-Warning" => "Redirect loop detected"); return $response; } $r = $r->previous; } return $self->request($referral, $arg, $size, $response); } elsif ($code == &HTTP::Status::RC_UNAUTHORIZED || $code == &HTTP::Status::RC_PROXY_AUTHENTICATION_REQUIRED ) { my $proxy = ($code == &HTTP::Status::RC_PROXY_AUTHENTICATION_REQUI +RED); my $ch_header = $proxy ? "Proxy-Authenticate" : "WWW-Authenticate +"; my @challenge = $response->header($ch_header); unless (@challenge) { $response->header("Client-Warning" => "Missing Authenticate header"); return $response; } require HTTP::Headers::Util; CHALLENGE: for my $challenge (@challenge) { $challenge =~ tr/,/;/; # "," is used to separate auth-params! +! ($challenge) = HTTP::Headers::Util::split_header_words($challe +nge); my $scheme = lc(shift(@$challenge)); shift(@$challenge); # no value $challenge = { @$challenge }; # make rest into a hash for (keys %$challenge) { # make sure all keys are lower +case $challenge->{lc $_} = delete $challenge->{$_}; } unless ($scheme =~ /^([a-z]+(?:-[a-z]+)*)$/) { $response->header("Client-Warning" => "Bad authentication scheme '$scheme'"); return $response; } $scheme = $1; # untainted now my $class = "LWP::Authen::\u$scheme"; $class =~ s/-/_/g; no strict 'refs'; unless (%{"$class\::"}) { # try to load it eval "require $class"; if ($@) { if ($@ =~ /^Can\'t locate/) { $response->header("Client-Warning" => "Unsupported authentication scheme '$scheme'"); } else { $response->header("Client-Warning" => $@); } next CHALLENGE; } } return $class->authenticate($self, $proxy, $challenge, $respon +se, $request, $arg, $size); } return $response; } return $response; }
    update: I have $LWP::Simple::VERSION = 1.33; and $LWP::VERSION = 5.51;

     
    ______crazyinsomniac_____________________________
    Of all the things I've lost, I miss my mind the most.
    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

      It is a known bug. (see post below.) I think it is a socket library problem. Removing the timeout in _trivial_get fixes it. Also, prepending a space will force it to not use the _trivial_get which will also get around it.

      I'll ++everything the person who fixes this writes. I spent over a month trying to figure it out. Drove me #!@#! nuts!

      -Lee

      "To be civilized is to deny one's nature."
Re: LWP::SIMPLE fails on certain URL
by grep (Monsignor) on Jan 27, 2002 at 14:02 UTC
    Have you checked the content of the result? I get:
    <!-- pcache/1.7.3 Cache loaded from file on warrant Sun Jan 27 03:31:0 +6 2002 --> <!-- $Id: WirePortal.pcgi,v 1.10 2001/12/13 21:04:53 jxu Exp $ --> <HTML> <HEAD> <META HTTP-EQUIV=Refresh CONTENT="1; URL=/public_pages/WirePortal.pcgi +/nojs.html"> <TITLE>Testing JavaScript</TITLE> <script language="JavaScript"> <!-- function gogo() { self.location.href='/public_pages/WirePortal.pcgi/us_portal.html' } // --> </script> <BODY BGCOLOR="#FFFFFF" onload="gogo()"> </BODY> </HTML>
    This looks like a redirect to me. I would suggest looking at the page that it redirects to (preferably the non-javascript one) or finding out from the administrators of the site where you should be looking at.

    grep
    grep> cd pub
    grep> more beer
      Hmmm ... i did check the content of the result and nothing was there. I ran the program in the browser, viewed the source and zilch. Regarding what you managed to pull off the site ... it's interesting that the portion retreived is only a fraction of what can be viewed if the URL is entered in the "Address:" box of the browser and source code viewed. I'm still unclear on what may be happening here.
        Here is the code I used:
        #!/usr/bin/perl -w use strict; use LWP::Simple; my $url = "http://wire.ap.org/APnews/center_minor.html?FRONTID=SCIENCE +"; my $body = get("$url"); print "$body";

        If this helps $LWP::Simple::VERSION = 1.35

        Update: I ran it a couple more times and I continue to get content.

        grep
        grep> cd pub
        grep> more beer
Re: LWP::Simple fails on certain URL
by particle (Vicar) on Jan 27, 2002 at 19:08 UTC
    following crazyinsomniac:

    my script:

    #!/usr/bin/perl -w use strict; $|=1; use LWP::Simple; use LWP::UserAgent; print "\$LWP::Simple::VERSION is $LWP::Simple::VERSION\n"; print "Content-type: text/html\n\n"; my $url = "http://wire.ap.org/APnews/center_minor.html?FRONTID=SCIENCE +"; my $body = get("$url"); print "$body";
    produces the following output:

    C:\WINDOWS\Desktop>perl test_lwpsimple.pl $LWP::Simple::VERSION is 1.33 Content-type: text/html Use of uninitialized value in string at test_lwpsimple.pl line 16. C:\WINDOWS\Desktop>
    with the default distribution from ActiveState in build 631. i downloaded and installed libwww-perl-5.63 in an alternate directory and prepended it to @INC, yeilding:

    C:\WINDOWS\Desktop>perl -Ic:\perllib\libwww-perl-5.63\lib test_lwpsimp +le.pl $LWP::Simple::VERSION is 1.35 Content-type: text/html <!-- pcache/1.7.3 Cache loaded from file on warrant Sun Jan 27 07:34:0 +9 2002 --> <!-- $Id: WirePortal.pcgi,v 1.10 2001/12/13 21:04:53 jxu Exp $ --> <HTML> <HEAD> <META HTTP-EQUIV=Refresh CONTENT="1; URL=/public_pages/WirePortal.pcgi +/nojs.html "> <TITLE>Testing JavaScript</TITLE> <script language="JavaScript"> <!-- function gogo() { self.location.href='/public_pages/WirePortal.pcgi/us_portal.html' } // --> </script> <BODY BGCOLOR="#FFFFFF" onload="gogo()"> </BODY> </HTML> C:\WINDOWS\Desktop>
    so it looks like an upgrade will do you good.

    ~Particle

Re: LWP::SIMPLE fails on certain URL
by Zaxo (Archbishop) on Jan 27, 2002 at 14:06 UTC
    if (defined $body) { print "$body"; } else { # try it with LWP::UserAgent to see headers }

    After Compline,
    Zaxo

Re: LWP::SIMPLE fails on certain URL
by dws (Chancellor) on Jan 27, 2002 at 14:46 UTC
    Is it possible that there some security system on this page that will not allow it to be retrieved?

    Yes. It's also possible that the site is reponding to the User-Agent: header, though that seems unlikely since other using LWP have been able to fetch this page. You might give it a try, though.

    To set User-Agent: yourself, you'll need to use LWP::UserAgent instead of LWP.

Re: LWP::SIMPLE fails on certain URL
by shotgunefx (Parson) on Jan 28, 2002 at 03:03 UTC
    I ran into similar issues and can repeat your results. Are you on a SPARC? I posted a bug report on this on sourceforge a year ago. This started happening for me over a year ago after an upgrade. I've upgraded several times but no luck. Currently use v1.34

    Here's a work around. Prepend a space to the Url. It won't use _trivial_get that way. (Don't ask me why I thought to try this, Zen I guess.)
    # This don't work for me. perl -MLWP::Simple -e 'my $a=get("http://wire.ap.org/APnews/center_min +or.html?FRONTID=SCIENCE"); print $a;' #This does perl -MLWP::Simple -e 'my $a=get(" http://wire.ap.org/APnews/center_mi +nor.html?FRONTID=SCIENCE"); print $a;'
    This drove me nuts for quite some time. Never was able to understand why it happens. Only with get, not getstore or getprint.

    -Lee

    "To be civilized is to deny one's nature."
Re: LWP::SIMPLE fails on certain URL
by screamingeagle (Curate) on Jan 27, 2002 at 14:20 UTC
    I just ran your code and I got a complete HTML page back as response. It might be possible that when you were trying your script, the target web site might be having some problems of its own...and since there was no error checking being done in your code,well,network issues on the target side might have been the problem
      Criminy ... I wonder why this still isn't working for me. I just tried again with the same result of nothing fetched at all.

      #!/usr/bin/perl use LWP::Simple; use LWP::UserAgent; use DBI; use CGI::Carp qw/ fatalsToBrowser /; print "Content-type: text/html\n\n"; $url = "http://wire.ap.org/APnews/center_minor.html?FRONTID=SCIENCE"; $body = get("$url"); print "$body";

      Is there something I'm missing here. I wouldn't think so because when i do this for other pages they all come out fine. Hmm ... still a little confused.

        Try changing your content type to 'text/plain' or escape the html with something like HTML::Entities. I have a feeling that $body contains the correct stuff, its just getting misinterpreted somewhere between the variable and the 'view source' window of your browser...

        -Blake

It's not your fault... Or LWP's fault.
by joealba (Hermit) on Jan 28, 2002 at 08:47 UTC
    One thing you should know: The Web site wire.ap.org looks at the referring url from every http request it gets before giving you the page. That's because wire.ap.org frames the content with the referrer's top navigation. If it does not find a referrer (or that referrer is not in AP's list of partner sites), it tosses you to the "select a state" page.

    This referrer check is most likely causing the troubles you're having here. If you find a way to get past that, the user agent check, cookies, javascript test, and the funny frames will also make things interesting.

    For an example of what happens when wire.ap.org doesn't see a happy shiny user agent, use Netscape and turn off Javascript support -- then go to that url. It gives you nothing.

    For an example of the framing, go to projo.com and click on one of the stories in the "Top Stories from the AP" box. See the lovely frame at the top? Oooohh.. Aaaahhh..

    From what I gathered after speaking with the webmaster, there's all kinds of funky stuff going on with this site to keep people from scrubbing it for news.