checking for valid http links

swkronenfeld has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

Problem description: My company has some revision controlled documents that they store with an increasing file letter scheme (beginning with dash), i.e.

2222_-.doc
2222_A.doc
2222_B.doc
2222_C.doc
etc.
[download]

These documents are linked from the web, but they don't want to have to go in and change the HTML every time a document revision changes. So I wrote a CGI script which would get the path to the base file (i.e. 2222_.doc) as an input, and determine the correct revision and redirect the user. I originally wrote this using Net::FTP, and it was working, but then the web server crashed. The sysadmin had some problems with the Apache .netrc file (which was storing the login/password to the documents server), and he doesn't have the time to fix this.

So I have to rewrite without FTP access. I went to CPAN and looked up HTTP::Request, and went from there. My code is working at the moment, but it has to download the document twice in order for the user to view it. Once on the server to make sure it's valid, and again for the user to download it. Some of these files are rather large, so this seems like a waste of time. But even sending the user the direct output from the $ua->request() call won't save much time.

End result: I'm looking for a way to see if a link is valid without downloading the content at that link.

I found this script on CPAN which will print just the returned headers, but it still downloads the whole page before printing it. This is leading me to believe that this may not be possible? I guess this question is more of an HTTP question than a strictly Perl one, but since there are so many modules out there, I thought someone could shove me in the right direction if I'm missing something.

Here is my code:

#!/sw/local/bin/perl -Tw

use strict;
use HTTP::Request;
use LWP::UserAgent;

print "Content-type: text/html\n\n";
print "<html>\n";

my $path = $ENV{'QUERY_STRING'};
if(!$path) { dienice("Must pass in at least 1 argument.") }

my $file;
if($path =~ s:/([^/]+)$::) { $file = $1 }
else { dienice("Incorrectly formatted path : $path") }

my $ext; #file extension
if($file =~ s/\.(.+)$//) { $ext = $1 }
else { dienice("Oncorrectly formatted filename: $file") }

my $ua = LWP::UserAgent->new;

my $tmpfile;
foreach my $rev ("-", ('A'..'Z')) {
  $tmpfile = "$file$rev.$ext";
  my $request = HTTP::Request->new(GET => "$path/$tmpfile");
  my $response = $ua->request($request);
  print "$response->{_msg}<br>";
  last if($response->{_msg} eq "OK");
}

print "<head><meta http-equiv=Refresh content=\"10; URL=$path/$tmpfile
+\"></head><body>";
print "<a href=\"$path/$tmpfile\">Please click here if you are not aut
+omatically redirected</a><br>\n\
";
print "<p>Due to security measures, you will only be able to access fi
+les in /QUALITY/DocConSys in an\
 account that has access to this folder.<br>";
print "</body></html>";
exit;

sub dienice {
print "<body><h2>Error:</h2>$_[0]</body></html>";
exit;
}
[download]

btw, any other comments on my code or method are welcome. Thanks for your help

Comment on checking for valid http links Select or Download Code

Replies are listed 'Best First'.
•Re: checking for valid http links by merlyn (Sage) on Aug 14, 2003 at 19:53 UTC
If you just want to verify that the file is there, try a HEAD request instead of a GET request. If that gives you a false negative (it might, depending on the server), then set the `max_size` attribute of your `LWP::UserAgent` object to something like 512 bytes, so LWP will abort very quickly: `my $ua = LWP::UserAgent->new (max_size => 512); # note extra param my $tmpfile; foreach my $rev ("-", ('A'..'Z')) { $tmpfile = "$file$rev.$ext"; my $response = $ua->get("$path/$tmpfile"); # simpler interface ## you need to save $tmpfile here! last if $response->is_success; # better than yours }` [download] -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re: checking for valid http links by hardburn (Abbot) on Aug 14, 2003 at 19:50 UTC
The HTTP/1.1 specification requires all web servers to implement two methods: `GET` and `HEAD`. If you've done much CGI programming, you already know what `GET` does. `HEAD` gives you all the headers that would be sent in a `GET`, but without sending the actual data. At least, it will as long as your web server was coded properly (I've seen weird cases where a server does send the data in a `HEAD`, but I don't remember how it happend). So, you just need to change from a `GET` request to a `HEAD`. ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer Note: All code is untested, unless otherwise stated	[reply] [d/l] [select]
Re: checking for valid http links by revdiablo (Prior) on Aug 14, 2003 at 20:47 UTC
I see your question has already been answered thoroughly by the other fine monks, but I thought I might point out something that wasn't in your question. I notice you're doing a fair amount of directly printing HTML. You've probably heard this a thousand times, but this kind of thing should really be avoided. HTML::Template is my templating tool of choice, and I suggest you check it out. If you don't want the inconvenience of keeping a separate template file (though some may argue embedding the template directly in the Perl code is even less convenient), you can very easily put the text of the template in `__DATA__` at the end of the script. Here's a quick (and very untested) snippet showing how one might do such a thing: `#!/usr/bin/perl use strict; use warnings; use HTML::Template; my $var1 = 'foo'; my $var2 = 'bar'; my $template = HTML::Template->new(filehandle => *DATA); $template->param(var1 => $var1); $template->param(var2 => $var2); print "Content-Type: text/html\n\n", $template->output; __DATA__ <html> <head> <title>Test HTML::Template!</title> </head> <body> <p>Var1 is set to <!-- TMPL_VAR NAME="var1" --><br/> Var2 is set to <!-- TMPL_VAR NAME="var2" --></p> </body> </html>` [download]	[reply] [d/l] [select]
Re: checking for valid http links by swkronenfeld (Hermit) on Aug 14, 2003 at 19:59 UTC
doh! I spent so much time looking for such a simple fix. Thanks for your quick responses!	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks