Update: Upon request, a copy of this has been posted to Tutorials. Thanks all for the response!
Introduction
I have lately had reason to use LWP::Simple for lots of small tasks, including: downloading a pdf on the command line without wget, since my browser didn't get it right, fetching the Chatterbox XML ticker and doing lots of other small tasks. None of which would have been quite as easy without LWP::Simple, although there are of course alternatives. But, as I'm sure you have heard, it is recommended to "Do the simplest thing that could possibly work". Which I feel, in perl programming, is often using the package named XXX::Simple.
While doing this, I've "discovered" a few neat tricks that makes its use even simpler, or more effective, and I'd though I'd also share a few other things that might not be a given (like the HEAD part) for those not so familiar with HTTP and web servers.
I'm sure there is more that I am missing out here, but these things made life easier on me, at least. So here goes:
Read the documentation
Sounds like a given, but it is easy to neglect - or to think that one remembers everything. lwpcook, LWP::Simple, LWP::UserAgent and LWP are good places to look. Or just type perldoc name on your command line - you should have this utility bundled with your perl distribution.
This mini tutorial assumes that you have some basic knowledge of using LWP::Simple.
Export the UserAgent
A poorly documented feature of LWP::Simple is that it supports exporting the LWP::UserAgent object it uses to fetch with.
Why would you want to do that? Well, the default timeout for LWP::Simple is the same as for LWP::UserAgent, that is 180 seconds, or three minutes. This might often be way too long. In one real life example of mine, I had a small script going live every minute, fetching something from the web - such a timeout might mean that I have several copies of the script running simultanelously, potentially accessing the same log files or something similar. There are other ways to work around this, of course, with setting alarms, or implement file locking. But it made no sense either way, since if the page didn't respond within 30 seconds, it was probably down anyways.
This code will take care of this problem:
Another thing you might want to do is change your reported useragent:# Note that if you do this, you must explicitly # export everything you want to use: use LWP::Simple qw($ua get); $ua->timeout(30); my $html = get $webpage || die "Timed out!";
$ua->agent('My agent/1.0');
If you want to do several requests, of which the first should include a login, or something else stateful which uses cookies, you can even attach a cookiejar to use with LWP::Simple:
And, as usual with cookiejars, you can of course specify a file to save the cookies in, between invokations of the script.use LWP::Simple qw($ua get); use HTTP::Cookies; $ua->cookie_jar(HTTP::Cookies->new); get $webpage . $login_string; my $logged_in_page = get $webpage . $private_page;
As you can see, this opens up some possibilities for extra tweaking. But why not use LWP::UserAgent then, instead? Well, simply because this way is so much simpler if you only need those small extras. The corresponding LWP::UserAgent example for timeout looks like this:
As you can see, lots more typing. See LWP::UserAgent for all possibilities you have on this.use LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->timeout(30); $request = HTTP::Request->new('GET', $webpage); $response = $ua->request($request); my $html = $response->content;
Update: I added the next section after gettin inspiration from arunhorne's node below. I think that there is a simpler way to do this, again in some cases. It is somewhat related to the previous section, since you might use the UserAgent for this.
Use environment variables to set proxies
As pointed out by arunhorne below, it is sometimes necessary to use a proxy because you are behind a firewall. Like suggested, one can always use the exported UserAgent to cope with this, by setting $ua->proxy.
But upon init, LWP::Simple will also call $ua->env_proxy as described at LWP::UserAgent, which means that if you use the same script somewhere else, or several LWP::Simple scripts, it might be easier to simply set your environment variables, like http_proxy for all http requests. However, if the proxy requires credentials, I don't think that is possible to do via the environment, in which case you must resort to the UserAgent way of doing things.
This is an easy way to set your proxy, on that machine, for all eternity - without modifying the script. That may, or may not be what you want. :)
The docs on LWP::UserAgent mentions how to set these on *NIX based platforms, I just want to add that on Windows, the command you want is set - try to type set /? to get some instructions. Or just set it the GUI way, which should be somewhere below the control panel.
Use LWP::Simple on the command line
This is well documented in lwpcook, but it is worth mentioning. *NIX people usually have the excellent wget program to take care of this stuff. It is probably available somewhere for other platforms as well, and it is included in cygwin as well (though not by default).
But if you know how to use LWP::Simple on the command line, and you have perl available (you do have perl on all your computers, right?) then you already know how to fetch files and pages on any platform. This is a very nice tool to have in ones toolbox.
You could even use the chatterbox from the command line, using any of these (depending on if you are more fluent in XML or HTML) to read it:
...and something like this to post your own messages:perl -MLWP::Simple -e "getprint 'http://perlmonks.org?node_id=145587'" perl -MLWP::Simple -e "getprint 'http://perlmonks.org?node=showchatmes +sages&displaytype=raw'"
Although, for your sanitys sake, I do not really recommend it... :)perl -MLWP::Simple -e "get 'http://perlmonks.org?op=login&user=Dog and + Pony&passwd=doNotUseThisPW&op=message&message=Hi it is me on the com +mand line!'"
Try using get to post data into forms
Many forms out there on the web doesn't really need a POST request to accept your data. One good example is the regular search box on the top of the perlmonks pages; it expects the field 'node' to contain some search words. But it doesn't care if it is a GET or POST, even though the form itself uses a POST.
This code works just fine:
What it is really about is of course that it is possible to do a check on the server if it is really a post that is coming our way or not. PerlMonks has wisely chosen not to do so, thus making it much simpler for people to use this ability - not to mention that arbitrary linking such as [some words] uses this to link as best as it can. Very useful for names in the chatterbox for instance.my $words = 'LWP::Simple tutorial' my $html = get "http://www.perlmonks.org/index.pl?node=$words";
The way to POST data described in lwpcook is not very hard or complex either, but this way still beats it.
Use the HTTP status codes when possible
LWP::Simple also exports the HTTP::Status constants and procedures, as documented. The author notes that this is a mistake and makes LWP::Simple slower, but while it is there, we should really take advantage of it for the functions that makes it possible.
The functions in LWP::Simple that return a HTTP status code are getprint, getstore and mirror. This is for example the number '200' for a succesful fetch, or '404' for 'Page not found', as documented in HTTP::Status. We can use these numbers to determine the success or failure of a fetch.
But it is simpler than that, unless we have special needs, as we also get the functions is_success and is_error exported, that we can feed these numbers to and get a quick answer to if everything is fine or not:
Note: If you do the trick with exporting the UserAgent above, you will need to explicitly export these functions too.my $response_code = mirror $webpage, 'webpage.html'; die "Bad response $response_code" unless is_success($response_code);
Use head to determine if a site is up
This is somewhat covered in lwpcook, but it doesn't mention that this is much easier on the network traffic and the web server (if that is an issue). So if all you want to do is check if the server is responding, or if the document exists, without actually fetching it - use the function head:
It is also worth noting that pinging the server will not tell you if the web server is up, so this is the way you want to use for this.use LWP::Simple; print "$webpage exists and server is up!\n" if (head($webpage));
Of course, you also get some information in the form of a list from head if you want it. Namely Content-type, document length, last modified time, expiry date and server name, in that order.
Will print this data for the webpage of your choice.my @headers = head $webpage; print join "\n", @headers;
Drawbacks
Well, none that aren't advertised in the documentation, but there are some things that one may or may not like:
- LWP::Simple might seem limited. Well, it is, by design. Of course it would be nice to be able to do POSTS as easy, but I've noticed that I rarely actually need that, and there are still ways to do it when you do need it. LWP::Simple seems to cover most of the basic cases you stumble upon.
- LWP::Simple pollutes the name space. Indeed it does, and that tends to be something I don't really like. If I see a subroutine call 'get', how do I know if it is mine or someone elses? This can be a problem when using someone elses code, or your own old. You can "solve" this by document the call with a comment, or by always calling your own subs with a prepending '&'. LWP::simple tends (for me) to show up in small scripts and oneliners, so then it isn't very hard to see what is going on, and it makes things much easier. It also allows you to easily use LWP::Simple on the commandline.
Final words
As you can see, there is lots and lots to gain by using LWP::Simple, and by using it right. Simple doesn't always have to mean (too) limited. I hope this has been a help in your web programming and/or automation tasks - sometimes, simple is all it takes.
You have moved into a dark place.
It is pitch black. You are likely to be eaten by a grue.
|
---|
Replies are listed 'Best First'. | |
---|---|
LWP::Simple UserAgent and Fire-walls
by arunhorne (Pilgrim) on May 20, 2002 at 17:47 UTC | |
(wil) Re: Getting more out of LWP::Simple
by wil (Priest) on May 20, 2002 at 17:40 UTC | |
Re: Getting more out of LWP::Simple
by ignatz (Vicar) on May 21, 2002 at 20:27 UTC | |
by Dog and Pony (Priest) on May 23, 2002 at 07:28 UTC | |
by Anonymous Monk on Mar 23, 2003 at 20:55 UTC | |
by Aristotle (Chancellor) on Mar 23, 2003 at 22:44 UTC | |
by Anonymous Monk on Mar 24, 2003 at 12:14 UTC | |
by jasonk (Parson) on Mar 23, 2003 at 21:18 UTC | |
by Aristotle (Chancellor) on Mar 23, 2003 at 23:02 UTC |