http://qs1969.pair.com?node_id=147433

Popcorn Dave has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone have any ideas on how to fool LWP::Simple's get function in to grabbing a page that has been redirected? I'm working on a perl program at present that pulls the web code and parses it, but it's getting stuck on a site that redirects. Short of using an regex on the code coming in, is there another method? TIA

Replies are listed 'Best First'.
•Re: LWP::Simple and redirects
by merlyn (Sage) on Feb 25, 2002 at 23:58 UTC
    It normally works fine for me for sites that redirect, unless you've done something unusual. Can you give us the URL so we can trace down what's actually happening?

    (And on second reading...) What would you possibly be looking at the content of what's coming back to locate a redirect? A redirect is in the headers, not the content, and LWP::Simple doesn't give you access to the headers.

    -- Randal L. Schwartz, Perl hacker

      Actually, you can get some headers from LWP::Simple:
      head($url) Get document headers. Returns the following 5 values if succ +essful: ($content_type, $document_length, $modi­fied_time, $expires, +$server) Returns an empty list if it fails. In scalar context return +s TRUE if successful.
      although you cant get any headers with the get method. merlyn is right though; you should use LWP::UserAgent instead.

      BlueLines

      Disclaimer: This post may contain inaccurate information, be habit forming, cause atomic warfare between peaceful countries, speed up male pattern baldness, interfere with your cable reception, exile you from certain third world countries, ruin your marriage, and generally spoil your day. No batteries included, no strings attached, your mileage may vary.
Re: LWP::Simple and redirects
by shotgunefx (Parson) on Feb 26, 2002 at 09:46 UTC
    LWP::Simple does indeed do redirects. (Not javascript redirects though!)

    use LWP::Debug qw(+conns); # This will tell you what's going on.

    One bug that has come up in LWP::Simple is on some machines, it just mysteriously fails. Try prepending a space to the url.
    get " http://yahoo.com;"
    Stops it from using _trivial_get internally which is the source of the bug.

    -Lee

    "To be civilized is to deny one's nature."
Re: LWP::Simple and redirects
by arhuman (Vicar) on Feb 26, 2002 at 08:05 UTC
    Are you GETting the page or POSTing.
    If you use POST your problem may come from strict RFC compliance.
    From memory :
    • RFC : If you get a 30x on a POST you shouldn't follow the redirection
    • BUT some (most) browsers follow the redirection in such case but GET the redirected page...
    Try to play with redirect_ok() ...

    "Only Bad Coders Code Badly In Perl" (OBC2BIP)
      Yes. I have written a subclass of LWP::UserAgent that replicates Netscape's behavior (ie. it will follow POST redirects but it will turn them into GETs before doing so). A lot of web applications rely on this non-standard behavior in browsers, making it a de-facto standard, so I think it would be a good idea to integrate such a class to LWP. I have just posted the code for it here

      By the way, overriding redirect_ok() to make UserAgent follow POST redirects is not enough, because redirects won't be converted into GETs, as browsers do. More on the subject at Redirect after POST behavior in LWP::UserAgent differs from Netscape's.

Re: LWP::Simple and redirects
by smitz (Chaplain) on Feb 26, 2002 at 12:36 UTC
    OK, I admit: this is not an answer to your question, but if you want that kind of functionality, use

    LWP::UserAgent

    LWP::Simple just aint clever enough, AFAIK.

    SMiTZ
Re: LWP::Simple and redirects
by Popcorn Dave (Abbot) on Feb 26, 2002 at 16:34 UTC
    The web site I was running in to problems with was http://www.courier.co.uk. LWP::Simple does grab the page's code for me as I'm using get(<www addy>) and it does have <meta http-equiv="Refresh" in the source. I will look in to LWP::UserAgent and see if that will work more efficiently. Thanks to all!
      It does have a redirect.
      document.location.href = "/index.jsp"; 
      
      I would normally suggest just changing the url to http://www.courier.co.uk/index.jsp but when I do this, it works but there is no content. Maybe it's agent sensitive, try using UserAgent instead and set the UserAgent string to match IE.

      -Lee

      "To be civilized is to deny one's nature."