ColonelPanic has asked for the wisdom of the Perl Monks concerning the following question:

I have a CGI script that needs to occasionally grab an HTML page. Unfortunately, the (linux) server it is running on does not have LWP installed, and I can't install modules. (Not easily at least.) Besides, I want this script to be easily portable to other servers if possible. What is the easiest way to get an HTML page? I have also tried this:
$page = `lynx -source www.address.com`; #--or-- system('lynx -source www.address.com -o file.txt');
This doesn't seem to work. In fact, no system commands seem to work. Thanks for your help

Replies are listed 'Best First'.
Re: getting HTML source
by c-era (Curate) on Jan 05, 2001 at 00:30 UTC
    If you cannot get LWP here is a way that will work.
    NOTE This code is only a very basic way to get the html from a web page, I do not recomend using it if at all posible, but it will work fine if all you want to do is grab the html and save it to a file.
    use IO::Socket; my $host = "www.perlmonks.com"; my $port = 80; my $path = "/"; # The path is everything after the host name (includin +g the first /) # You need the two \n in the get request my $get =<<EOF; GET $path HTTP/1.0 EOF # Open our socket my $http_request = IO::Socket::INET->new( Proto => "tcp", PeerAddr => $host, PeerPort => $port, ) or die $!; # Write our request print $http_request $get; # Grab the output my @html = <$http_request>; # Do something with the output print @html;
Re: getting HTML source
by ichimunki (Priest) on Jan 05, 2001 at 00:07 UTC
    Have you tried installing LWP to a local directory and using use libs './path/to/local/modules/'? This way your code can still get at the module and you do not affect the site install of perl. I can't think of a way to solve this that isn't a coding nightmare or dependent on systems calls that function. I like turnstep's suggestion re: absolute paths.

    note: this post has been modified.
Re: getting HTML source
by turnstep (Parson) on Jan 05, 2001 at 00:02 UTC

    And, as always, make sure that you give a complete path to your system commands (e.g. lynx) so the CGI script knows where it is. I recommend putting them all at the top of your script for easy maintenance:

    #!/bin/perl -- major peices missing, but you get the picture use CGI; my $lynx = "/usr/bin/foo/bar/lynx"; my $wget = "/export/home/genghis/bin/wget"; ##etc..

    Same goes for that ourput file in your example (file.txt) - make sure you are giving an absolute path, as you cannot be sure from what directory lynx is being invoked.

    Heck, as a general rule of thumb, use absolute paths for everything - it will save a lot of trouble in the long run. :)

      Well I'm amazed I'm actually saying this about a turnstep post, but any advice that is pro hardcoded 'absolute file paths' I've got to add some advice. For code reuse and maintenance I've developed the general of rule of storing all absolute paths in an external file (.cnf .que, etc.)an grabbing them from there. Making one change in an external file is much more economical and portable.

Re: getting HTML source
by dws (Chancellor) on Jan 05, 2001 at 00:49 UTC
Re: getting HTML source
by Fastolfe (Vicar) on Jan 05, 2001 at 01:00 UTC
    I fail to see how not installing LWP will make your script more portable. Bundle LWP with your script when you distribute it if necessary. A reliance on things like 'lynx' is precisely what we're trying to avoid by using modules in the first place.

    Your problem with your two system calls is probably because 'lynx' isn't in the script's path. Specify an absolute filename to the program or update your PATH accordingly.

    If you "can't" install a module, you might just consider using IO::Socket and request the page by hand using Perl's native network support. That would be the "most" portable solution, but obviously you're never going to code yourself a solution that's as robust as using LWP in the first place will be.

      Thanks for your help. I think that indeed was my problem with the calls to lynx. However, I have since abandoned that method precisely because of nonportability. The script now looks for LWP, and if it isn't found uses IO::Socket. It works fine for my purposes. Installing the module would be my first choice, but my primary goal is ease of installation. It has to be as easy to install as possible by people who don't know anything about Perl, and have as few files as possible.
Re: getting HTML source
by kschwab (Vicar) on Jan 04, 2001 at 23:53 UTC
    LWP was certainly my first thought. I'm not clear on why you can't install a module. In any case, have you tried: lynx -source http://www.address.com/ ? I suspect lynx wants a real url, not just a hostname.
Re: getting HTML source
by EvanK (Chaplain) on Jan 04, 2001 at 23:51 UTC
    well, you could always do it the harder (but sure to work) way: using sockets, (the server you're on is bound to have some sort of socket module) connect to the server you wanna grab the html from (remember to connect on port 80), request the html file (which you could then redirect to a variable or a file on your own server), and close the socket connection.
    Then do whatever you need with the html...that oughtta work.

    ______________________________________________
    It's hard to believe that everyone here is the result of the smartest sperm.

Re: getting HTML source
by ColonelPanic (Friar) on Jan 05, 2001 at 00:55 UTC
    Thanks everyone for your help. Hopefully I can get it working now.