getting HTML source

ColonelPanic has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: getting HTML source by c-era (Curate) on Jan 05, 2001 at 00:30 UTC
If you cannot get LWP here is a way that will work. NOTE This code is only a very basic way to get the html from a web page, I do not recomend using it if at all posible, but it will work fine if all you want to do is grab the html and save it to a file. `use IO::Socket; my $host = "www.perlmonks.com"; my $port = 80; my $path = "/"; # The path is everything after the host name (includin +g the first /) # You need the two \n in the get request my $get =<<EOF; GET $path HTTP/1.0 EOF # Open our socket my $http_request = IO::Socket::INET->new( Proto => "tcp", PeerAddr => $host, PeerPort => $port, ) or die $!; # Write our request print $http_request $get; # Grab the output my @html = <$http_request>; # Do something with the output print @html;` [download]	[reply] [d/l]
Re: getting HTML source by ichimunki (Priest) on Jan 05, 2001 at 00:07 UTC
Have you tried installing LWP to a local directory and using `use libs './path/to/local/modules/'`? This way your code can still get at the module and you do not affect the site install of perl. I can't think of a way to solve this that isn't a coding nightmare or dependent on systems calls that function. I like turnstep's suggestion re: absolute paths. note: this post has been modified.	[reply] [d/l]
Re: getting HTML source by turnstep (Parson) on Jan 05, 2001 at 00:02 UTC
And, as always, make sure that you give a complete path to your system commands (e.g. lynx) so the CGI script knows where it is. I recommend putting them all at the top of your script for easy maintenance: `#!/bin/perl -- major peices missing, but you get the picture use CGI; my $lynx = "/usr/bin/foo/bar/lynx"; my $wget = "/export/home/genghis/bin/wget"; ##etc..` [download] Same goes for that ourput file in your example (file.txt) - make sure you are giving an absolute path, as you cannot be sure from what directory lynx is being invoked. Heck, as a general rule of thumb, use absolute paths for everything - it will save a lot of trouble in the long run. :)	[reply] [d/l]
Re (Turnstep):Re: getting HTML source by coreolyn (Parson) on Jan 05, 2001 at 00:23 UTC
Well I'm amazed I'm actually saying this about a turnstep post, but any advice that is pro hardcoded 'absolute file paths' I've got to add some advice. For code reuse and maintenance I've developed the general of rule of storing all absolute paths in an external file (.cnf .que, etc.)an grabbing them from there. Making one change in an external file is much more economical and portable.	[reply]
Re: getting HTML source by dws (Chancellor) on Jan 05, 2001 at 00:49 UTC
This issue was covered a while back in Getting external HTML and Grabbing a web page without LWP or the like.	[reply]
Re: getting HTML source by Fastolfe (Vicar) on Jan 05, 2001 at 01:00 UTC
I fail to see how not installing LWP will make your script more portable. Bundle LWP with your script when you distribute it if necessary. A reliance on things like 'lynx' is precisely what we're trying to avoid by using modules in the first place. Your problem with your two system calls is probably because 'lynx' isn't in the script's path. Specify an absolute filename to the program or update your PATH accordingly. If you "can't" install a module, you might just consider using IO::Socket and request the page by hand using Perl's native network support. That would be the "most" portable solution, but obviously you're never going to code yourself a solution that's as robust as using LWP in the first place will be.	[reply]
Re: Re: getting HTML source by ColonelPanic (Friar) on Jan 06, 2001 at 02:22 UTC
Thanks for your help. I think that indeed was my problem with the calls to lynx. However, I have since abandoned that method precisely because of nonportability. The script now looks for LWP, and if it isn't found uses IO::Socket. It works fine for my purposes. Installing the module would be my first choice, but my primary goal is ease of installation. It has to be as easy to install as possible by people who don't know anything about Perl, and have as few files as possible.	[reply]
Re: getting HTML source by kschwab (Vicar) on Jan 04, 2001 at 23:53 UTC
LWP was certainly my first thought. I'm not clear on why you can't install a module. In any case, have you tried: lynx -source http://www.address.com/ ? I suspect lynx wants a real url, not just a hostname.	[reply]
Re: getting HTML source by EvanK (Chaplain) on Jan 04, 2001 at 23:51 UTC
well, you could always do it the harder (but sure to work) way: using sockets, (the server you're on is bound to have some sort of socket module) connect to the server you wanna grab the html from (remember to connect on port 80), request the html file (which you could then redirect to a variable or a file on your own server), and close the socket connection. Then do whatever you need with the html...that oughtta work. ______________________________________________ It's hard to believe that everyone here is the result of the smartest sperm.	[reply]
Re: getting HTML source by ColonelPanic (Friar) on Jan 05, 2001 at 00:55 UTC
Thanks everyone for your help. Hopefully I can get it working now.	[reply]