Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

checking for valid http links

by swkronenfeld (Hermit)
on Aug 14, 2003 at 19:27 UTC ( #283977=perlquestion: print w/replies, xml ) Need Help??

swkronenfeld has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

Problem description: My company has some revision controlled documents that they store with an increasing file letter scheme (beginning with dash), i.e.

2222_-.doc 2222_A.doc 2222_B.doc 2222_C.doc etc.

These documents are linked from the web, but they don't want to have to go in and change the HTML every time a document revision changes. So I wrote a CGI script which would get the path to the base file (i.e. 2222_.doc) as an input, and determine the correct revision and redirect the user. I originally wrote this using Net::FTP, and it was working, but then the web server crashed. The sysadmin had some problems with the Apache .netrc file (which was storing the login/password to the documents server), and he doesn't have the time to fix this.

So I have to rewrite without FTP access. I went to CPAN and looked up HTTP::Request, and went from there. My code is working at the moment, but it has to download the document twice in order for the user to view it. Once on the server to make sure it's valid, and again for the user to download it. Some of these files are rather large, so this seems like a waste of time. But even sending the user the direct output from the $ua->request() call won't save much time.

End result: I'm looking for a way to see if a link is valid without downloading the content at that link.

I found this script on CPAN which will print just the returned headers, but it still downloads the whole page before printing it. This is leading me to believe that this may not be possible? I guess this question is more of an HTTP question than a strictly Perl one, but since there are so many modules out there, I thought someone could shove me in the right direction if I'm missing something.

Here is my code:

#!/sw/local/bin/perl -Tw use strict; use HTTP::Request; use LWP::UserAgent; print "Content-type: text/html\n\n"; print "<html>\n"; my $path = $ENV{'QUERY_STRING'}; if(!$path) { dienice("Must pass in at least 1 argument.") } my $file; if($path =~ s:/([^/]+)$::) { $file = $1 } else { dienice("Incorrectly formatted path : $path") } my $ext; #file extension if($file =~ s/\.(.+)$//) { $ext = $1 } else { dienice("Oncorrectly formatted filename: $file") } my $ua = LWP::UserAgent->new; my $tmpfile; foreach my $rev ("-", ('A'..'Z')) { $tmpfile = "$file$rev.$ext"; my $request = HTTP::Request->new(GET => "$path/$tmpfile"); my $response = $ua->request($request); print "$response->{_msg}<br>"; last if($response->{_msg} eq "OK"); } print "<head><meta http-equiv=Refresh content=\"10; URL=$path/$tmpfile +\"></head><body>"; print "<a href=\"$path/$tmpfile\">Please click here if you are not aut +omatically redirected</a><br>\n\ "; print "<p>Due to security measures, you will only be able to access fi +les in /QUALITY/DocConSys in an\ account that has access to this folder.<br>"; print "</body></html>"; exit; sub dienice { print "<body><h2>Error:</h2>$_[0]</body></html>"; exit; }

btw, any other comments on my code or method are welcome. Thanks for your help

Replies are listed 'Best First'.
•Re: checking for valid http links
by merlyn (Sage) on Aug 14, 2003 at 19:53 UTC
    If you just want to verify that the file is there, try a HEAD request instead of a GET request. If that gives you a false negative (it might, depending on the server), then set the max_size attribute of your LWP::UserAgent object to something like 512 bytes, so LWP will abort very quickly:
    my $ua = LWP::UserAgent->new (max_size => 512); # note extra param my $tmpfile; foreach my $rev ("-", ('A'..'Z')) { $tmpfile = "$file$rev.$ext"; my $response = $ua->get("$path/$tmpfile"); # simpler interface ## you need to save $tmpfile here! last if $response->is_success; # better than yours }

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: checking for valid http links
by hardburn (Abbot) on Aug 14, 2003 at 19:50 UTC

    The HTTP/1.1 specification requires all web servers to implement two methods: GET and HEAD. If you've done much CGI programming, you already know what GET does. HEAD gives you all the headers that would be sent in a GET, but without sending the actual data. At least, it will as long as your web server was coded properly (I've seen weird cases where a server does send the data in a HEAD, but I don't remember how it happend).

    So, you just need to change from a GET request to a HEAD.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

Re: checking for valid http links
by revdiablo (Prior) on Aug 14, 2003 at 20:47 UTC

    I see your question has already been answered thoroughly by the other fine monks, but I thought I might point out something that wasn't in your question. I notice you're doing a fair amount of directly printing HTML. You've probably heard this a thousand times, but this kind of thing should really be avoided. HTML::Template is my templating tool of choice, and I suggest you check it out. If you don't want the inconvenience of keeping a separate template file (though some may argue embedding the template directly in the Perl code is even less convenient), you can very easily put the text of the template in __DATA__ at the end of the script.

    Here's a quick (and very untested) snippet showing how one might do such a thing:

    #!/usr/bin/perl use strict; use warnings; use HTML::Template; my $var1 = 'foo'; my $var2 = 'bar'; my $template = HTML::Template->new(filehandle => *DATA); $template->param(var1 => $var1); $template->param(var2 => $var2); print "Content-Type: text/html\n\n", $template->output; __DATA__ <html> <head> <title>Test HTML::Template!</title> </head> <body> <p>Var1 is set to <!-- TMPL_VAR NAME="var1" --><br/> Var2 is set to <!-- TMPL_VAR NAME="var2" --></p> </body> </html>
Re: checking for valid http links
by swkronenfeld (Hermit) on Aug 14, 2003 at 19:59 UTC
    doh! I spent so much time looking for such a simple fix.

    Thanks for your quick responses!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://283977]
Approved by hardburn
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2022-09-27 14:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (119 votes). Check out past polls.

    Notices?