Re: getting HTML source
by c-era (Curate) on Jan 05, 2001 at 00:30 UTC
|
If you cannot get LWP here is a way that will work. NOTE This code is only a very basic way to get the html from a web page, I do not recomend using it if at all posible, but it will work fine if all you want to do is grab the html and save it to a file.
use IO::Socket;
my $host = "www.perlmonks.com";
my $port = 80;
my $path = "/"; # The path is everything after the host name (includin
+g the first /)
# You need the two \n in the get request
my $get =<<EOF;
GET $path HTTP/1.0
EOF
# Open our socket
my $http_request = IO::Socket::INET->new(
Proto => "tcp",
PeerAddr => $host,
PeerPort => $port,
) or die $!;
# Write our request
print $http_request $get;
# Grab the output
my @html = <$http_request>;
# Do something with the output
print @html;
| [reply] [d/l] |
Re: getting HTML source
by ichimunki (Priest) on Jan 05, 2001 at 00:07 UTC
|
Have you tried installing LWP to a local directory and using use libs './path/to/local/modules/'? This way your code can still get at the module and you do not affect the site install of perl. I can't think of a way to solve this that isn't a coding nightmare or dependent on systems calls that function. I like turnstep's suggestion re: absolute paths.
note: this post has been modified. | [reply] [d/l] |
Re: getting HTML source
by turnstep (Parson) on Jan 05, 2001 at 00:02 UTC
|
And, as always, make sure that you give a complete path
to your system commands (e.g. lynx) so the CGI script knows where it
is. I recommend putting them all at the top of your script
for easy maintenance:
#!/bin/perl -- major peices missing, but you get the picture
use CGI;
my $lynx = "/usr/bin/foo/bar/lynx";
my $wget = "/export/home/genghis/bin/wget";
##etc..
Same goes for that ourput file in your example (file.txt) -
make sure you are giving an absolute path, as you cannot be
sure from what directory lynx is being invoked.
Heck, as a general rule of thumb, use absolute paths
for everything - it will save a lot of trouble in the long
run. :)
| [reply] [d/l] |
|
|
Well I'm amazed I'm actually saying this about a turnstep post, but any advice that is pro hardcoded 'absolute file paths' I've got to add some advice. For code reuse and maintenance I've developed the general of rule of storing all absolute paths in an external file (.cnf .que, etc.)an grabbing them from there. Making one change in an external file is much more economical and portable.
| [reply] |
Re: getting HTML source
by dws (Chancellor) on Jan 05, 2001 at 00:49 UTC
|
| [reply] |
Re: getting HTML source
by Fastolfe (Vicar) on Jan 05, 2001 at 01:00 UTC
|
I fail to see how not installing LWP will make your script more portable. Bundle LWP with your script when you distribute it if necessary. A reliance on things like 'lynx' is precisely what we're trying to avoid by using modules in the first place.
Your problem with your two system calls is probably because 'lynx' isn't in the script's path. Specify an absolute filename to the program or update your PATH accordingly.
If you "can't" install a module, you might just consider using IO::Socket and request the page by hand using Perl's native network support. That would be the "most" portable solution, but obviously you're never going to code yourself a solution that's as robust as using LWP in the first place will be. | [reply] |
|
|
Thanks for your help. I think that indeed was my problem with the calls to lynx. However, I have since abandoned that method precisely because of nonportability. The script now looks for LWP, and if it isn't found uses IO::Socket. It works fine for my purposes. Installing the module would be my first choice, but my primary goal is ease of installation. It has to be as easy to install as possible by people who don't know anything about Perl, and have as few files as possible.
| [reply] |
Re: getting HTML source
by kschwab (Vicar) on Jan 04, 2001 at 23:53 UTC
|
LWP was certainly my first thought.
I'm not clear on why you can't install a module.
In any case, have you tried:
lynx -source http://www.address.com/ ?
I suspect lynx wants a real url, not just a hostname. | [reply] |
Re: getting HTML source
by EvanK (Chaplain) on Jan 04, 2001 at 23:51 UTC
|
well, you could always do it the harder (but sure to work) way: using sockets, (the server you're on is bound to have some sort of socket module) connect to the server you wanna grab the html from (remember to connect on port 80), request the html file (which you could then redirect to a variable or a file on your own server), and close the socket connection. Then do whatever you need with the html...that oughtta work.
______________________________________________
It's hard to believe that everyone here is the result
of the smartest sperm. | [reply] |
Re: getting HTML source
by ColonelPanic (Friar) on Jan 05, 2001 at 00:55 UTC
|
Thanks everyone for your help. Hopefully I can get it working now. | [reply] |