Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Update: Upon request, a copy of this has been posted to Tutorials. Thanks all for the response!

Introduction

I have lately had reason to use LWP::Simple for lots of small tasks, including: downloading a pdf on the command line without wget, since my browser didn't get it right, fetching the Chatterbox XML ticker and doing lots of other small tasks. None of which would have been quite as easy without LWP::Simple, although there are of course alternatives. But, as I'm sure you have heard, it is recommended to "Do the simplest thing that could possibly work". Which I feel, in perl programming, is often using the package named XXX::Simple.

While doing this, I've "discovered" a few neat tricks that makes its use even simpler, or more effective, and I'd though I'd also share a few other things that might not be a given (like the HEAD part) for those not so familiar with HTTP and web servers.

I'm sure there is more that I am missing out here, but these things made life easier on me, at least. So here goes:

Read the documentation

Sounds like a given, but it is easy to neglect - or to think that one remembers everything. lwpcook, LWP::Simple, LWP::UserAgent and LWP are good places to look. Or just type perldoc name on your command line - you should have this utility bundled with your perl distribution.

This mini tutorial assumes that you have some basic knowledge of using LWP::Simple.

Export the UserAgent

A poorly documented feature of LWP::Simple is that it supports exporting the LWP::UserAgent object it uses to fetch with.

Why would you want to do that? Well, the default timeout for LWP::Simple is the same as for LWP::UserAgent, that is 180 seconds, or three minutes. This might often be way too long. In one real life example of mine, I had a small script going live every minute, fetching something from the web - such a timeout might mean that I have several copies of the script running simultanelously, potentially accessing the same log files or something similar. There are other ways to work around this, of course, with setting alarms, or implement file locking. But it made no sense either way, since if the page didn't respond within 30 seconds, it was probably down anyways.

This code will take care of this problem:

# Note that if you do this, you must explicitly # export everything you want to use: use LWP::Simple qw($ua get); $ua->timeout(30); my $html = get $webpage || die "Timed out!";
Another thing you might want to do is change your reported useragent:
$ua->agent('My agent/1.0');

If you want to do several requests, of which the first should include a login, or something else stateful which uses cookies, you can even attach a cookiejar to use with LWP::Simple:

use LWP::Simple qw($ua get); use HTTP::Cookies; $ua->cookie_jar(HTTP::Cookies->new); get $webpage . $login_string; my $logged_in_page = get $webpage . $private_page;
And, as usual with cookiejars, you can of course specify a file to save the cookies in, between invokations of the script.

As you can see, this opens up some possibilities for extra tweaking. But why not use LWP::UserAgent then, instead? Well, simply because this way is so much simpler if you only need those small extras. The corresponding LWP::UserAgent example for timeout looks like this:

use LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->timeout(30); $request = HTTP::Request->new('GET', $webpage); $response = $ua->request($request); my $html = $response->content;
As you can see, lots more typing. See LWP::UserAgent for all possibilities you have on this.

Update: I added the next section after gettin inspiration from arunhorne's node below. I think that there is a simpler way to do this, again in some cases. It is somewhat related to the previous section, since you might use the UserAgent for this.

Use environment variables to set proxies

As pointed out by arunhorne below, it is sometimes necessary to use a proxy because you are behind a firewall. Like suggested, one can always use the exported UserAgent to cope with this, by setting $ua->proxy.

But upon init, LWP::Simple will also call $ua->env_proxy as described at LWP::UserAgent, which means that if you use the same script somewhere else, or several LWP::Simple scripts, it might be easier to simply set your environment variables, like http_proxy for all http requests. However, if the proxy requires credentials, I don't think that is possible to do via the environment, in which case you must resort to the UserAgent way of doing things.

This is an easy way to set your proxy, on that machine, for all eternity - without modifying the script. That may, or may not be what you want. :)

The docs on LWP::UserAgent mentions how to set these on *NIX based platforms, I just want to add that on Windows, the command you want is set - try to type set /? to get some instructions. Or just set it the GUI way, which should be somewhere below the control panel.

Use LWP::Simple on the command line

This is well documented in lwpcook, but it is worth mentioning. *NIX people usually have the excellent wget program to take care of this stuff. It is probably available somewhere for other platforms as well, and it is included in cygwin as well (though not by default).

But if you know how to use LWP::Simple on the command line, and you have perl available (you do have perl on all your computers, right?) then you already know how to fetch files and pages on any platform. This is a very nice tool to have in ones toolbox.

You could even use the chatterbox from the command line, using any of these (depending on if you are more fluent in XML or HTML) to read it:

perl -MLWP::Simple -e "getprint 'http://perlmonks.org?node_id=145587'" perl -MLWP::Simple -e "getprint 'http://perlmonks.org?node=showchatmes +sages&displaytype=raw'"
...and something like this to post your own messages:
perl -MLWP::Simple -e "get 'http://perlmonks.org?op=login&user=Dog and + Pony&passwd=doNotUseThisPW&op=message&message=Hi it is me on the com +mand line!'"
Although, for your sanitys sake, I do not really recommend it... :)

Try using get to post data into forms

Many forms out there on the web doesn't really need a POST request to accept your data. One good example is the regular search box on the top of the perlmonks pages; it expects the field 'node' to contain some search words. But it doesn't care if it is a GET or POST, even though the form itself uses a POST.

This code works just fine:

my $words = 'LWP::Simple tutorial' my $html = get "http://www.perlmonks.org/index.pl?node=$words";
What it is really about is of course that it is possible to do a check on the server if it is really a post that is coming our way or not. PerlMonks has wisely chosen not to do so, thus making it much simpler for people to use this ability - not to mention that arbitrary linking such as [some words] uses this to link as best as it can. Very useful for names in the chatterbox for instance.

The way to POST data described in lwpcook is not very hard or complex either, but this way still beats it.

Use the HTTP status codes when possible

LWP::Simple also exports the HTTP::Status constants and procedures, as documented. The author notes that this is a mistake and makes LWP::Simple slower, but while it is there, we should really take advantage of it for the functions that makes it possible.

The functions in LWP::Simple that return a HTTP status code are getprint, getstore and mirror. This is for example the number '200' for a succesful fetch, or '404' for 'Page not found', as documented in HTTP::Status. We can use these numbers to determine the success or failure of a fetch.

But it is simpler than that, unless we have special needs, as we also get the functions is_success and is_error exported, that we can feed these numbers to and get a quick answer to if everything is fine or not:

my $response_code = mirror $webpage, 'webpage.html'; die "Bad response $response_code" unless is_success($response_code);
Note: If you do the trick with exporting the UserAgent above, you will need to explicitly export these functions too.

Use head to determine if a site is up

This is somewhat covered in lwpcook, but it doesn't mention that this is much easier on the network traffic and the web server (if that is an issue). So if all you want to do is check if the server is responding, or if the document exists, without actually fetching it - use the function head:

use LWP::Simple; print "$webpage exists and server is up!\n" if (head($webpage));
It is also worth noting that pinging the server will not tell you if the web server is up, so this is the way you want to use for this.

Of course, you also get some information in the form of a list from head if you want it. Namely Content-type, document length, last modified time, expiry date and server name, in that order.

my @headers = head $webpage; print join "\n", @headers;
Will print this data for the webpage of your choice.

Drawbacks

Well, none that aren't advertised in the documentation, but there are some things that one may or may not like:

  • LWP::Simple might seem limited. Well, it is, by design. Of course it would be nice to be able to do POSTS as easy, but I've noticed that I rarely actually need that, and there are still ways to do it when you do need it. LWP::Simple seems to cover most of the basic cases you stumble upon.
  • LWP::Simple pollutes the name space. Indeed it does, and that tends to be something I don't really like. If I see a subroutine call 'get', how do I know if it is mine or someone elses? This can be a problem when using someone elses code, or your own old. You can "solve" this by document the call with a comment, or by always calling your own subs with a prepending '&'. LWP::simple tends (for me) to show up in small scripts and oneliners, so then it isn't very hard to see what is going on, and it makes things much easier. It also allows you to easily use LWP::Simple on the commandline.

Final words

As you can see, there is lots and lots to gain by using LWP::Simple, and by using it right. Simple doesn't always have to mean (too) limited. I hope this has been a help in your web programming and/or automation tasks - sometimes, simple is all it takes.


You have moved into a dark place.
It is pitch black. You are likely to be eaten by a grue.

In reply to Getting more out of LWP::Simple by Dog and Pony

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-03-28 23:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found