gecko has asked for the wisdom of the Perl Monks concerning the following question:

Howdy, I have code which should recognize a redirect when analyzing WWW::Robots stuff. Basically the jist is this: Robots has a number of hooks along the way which let you edit/change/check the content. The one RIGHT after the GET request is 'invoke-after-get'. Here is my hook for testing purposes:
'invoke-after-get' => sub { my($robot, $hook, $url, $response) = @_; if (DEBUG) { print "ORIG_URL: $url\n"; print "URL: "; for($response->header_field_names) { print +"$_\n"; } print "\n"; print "RESPONSE: ". $response->code ."\n"; print "\n"; }
All fine and dandy when you're dealing with normal stuff. The problem is that I need to check if it's a redirect (301 or a 302 response). Robots NEVER returns a 301 or a 302, always a 200 (Success), even on redirected pages: ie i have a page locally which redirects to google:
C:\Documents and Settings\gecko\Desktop>nc localhost 80 GET /cgi-bin/redirect.pl HTTP/1.1 host:localhost HTTP/1.1 302 Moved Date: Sun, 17 Jun 2007 02:34:36 GMT Server: Apache/2.2.4 (Win32) Location: http://www.google.com Content-Length: 0 Content-Type: text/plain
Yet when i do it with Robots:
ORIG_URL: http://127.0.0.1/cgi-bin/redirect.pl URL: Cache-Control Date Server Content-Type Client-Date Client-Peer Client-Response-Num Client-Transfer-Encoding Set-Cookie Title RESPONSE: 200 URL: http://www.google.com/ ORIG_URL: http://127.0.0.1/cgi-bin/redirect.pl RESPONSE: 200 SIZE: 5799 TITLE: Google
As you can see on the robots one, it doesnt even have a "Location" field, so it seems to be automatically following it, even though the hook is defined as this:
invoke-after-get This hook function is invoked immediately after the robot makes each GET request. This means your hook function will see every type of response, not just successful GETs.
how do you recommend i detect a 301/302 in this case? Thanks monks!

Replies are listed 'Best First'.
Re: WWW::Robots problem
by merlyn (Sage) on Jun 17, 2007 at 03:18 UTC
    Since LWP::RobotUA @ISA LWP::UserAgent, you're getting the default behavior where a 301/302 is handled "internally" for you. You'll need to instantiate your own LWP::RobotUA so that it doesn't follow the 30x, and then use the USERAGENT attribute of WWW::Robot to use your instance instead.

    Something like this might work:

    my $www_bot = WWW::Robot->new( ... USERAGENT => LWP::RobotUA->new(requests_redirectable => []), ... );