Fetching HTML Pages with Sockets

amt has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fetching HTML Pages with Sockets by ikegami (Patriarch) on Sep 19, 2004 at 18:47 UTC
That code works for me without any warnings, although it would hang until I called flush(SOCK) after the print, where flush is defined as `sub flush { select($_[0]); my $t=$\|; $\|=1; $\|=$t; }`. All of the functions you called above return undef on error, and set $! to an error number and "$!" to an error message. Why don't you check for errors? btw, LWP comes with perl. You probably should be using that. For starters, it knows that "\r\n" will not always do what you want. Use "\xD\xA" instead of "\r\n". Also, should do `local *SOCK` to limit the scope of SOCK.	[reply] [d/l] [select]
Re^2: Fetching HTML Pages with Sockets by amt (Monk) on Sep 19, 2004 at 19:23 UTC
Unfortunately, when I tried to `use LWP::UserAgent` it was not included in @INC, and getting my sysadmin to install anything, well, won't happen. When using the hex values, I get barked at for an invalid hex value. amt	[reply] [d/l]
Re^3: Fetching HTML Pages with Sockets by ikegami (Patriarch) on Sep 19, 2004 at 19:42 UTC
odd, it works for me in perl 5.6.1 and 5.8.0 How about octal? "\015\012"	[reply]
Re^4: Fetching HTML Pages with Sockets by amt (Monk) on Sep 19, 2004 at 19:57 UTC
Re^5: Fetching HTML Pages with Sockets by amt (Monk) on Sep 19, 2004 at 20:10 UTC
Some notes below your chosen depth have not been shown here
Re: Fetching HTML Pages with Sockets by davido (Cardinal) on Sep 20, 2004 at 03:05 UTC
I just thought I would mention this... LWP::Simple is a pure-Perl module, with only one dependency: HTTP::Status. That too is a pure-Perl module, with no dependencies. So both could be installed to a ~/lib path within your own user path, without sysadmin intervention, and without tricky compilation steps. It's pretty easy. After installing them, you would only need to adjust `@INC` so that perl can find where you've put those modules. Doing so would allow you to get pages via HTTP as easily as "`my $page = get('http://www.perlmonks.org');`." Dave	[reply] [d/l] [select]
Re^2: Fetching HTML Pages with Sockets by JamesNC (Chaplain) on Sep 20, 2004 at 11:24 UTC
I would also add that Win32 Perl installations have "get.bat" in the /Perl/bin dir so you can "get" pages from the command line or you can write simple one liners to fetch one or many pages or binaries :-) perl -le "`get http://www.perlmonks.com/index.pl?parent=392273;node_id +=3333 > node_$_.htm` for qw /3333/;" [download] This example fetches this node and the one above it. JamesNC	[reply] [d/l]
Re: Fetching HTML Pages with Sockets by tachyon (Chancellor) on Sep 19, 2004 at 23:45 UTC
You code won't work with many servers as they require a Host: header. sub socket_get { my ( $url, $cookies ) = @_; return ("HTTP/1.0 500 Invalid URL $url", '' ) unless $url =~ m!^http://([^/:\@]+)(?::(\d+))?(/\S)?$!; my $host = $1; my $port = $2 \|\| 80; my $path = $3; $path = "/" unless defined $path; require IO::Socket::INET; my $sock = IO::Socket::INET->new( PeerAddr => $host, Proto => 'tcp', PeerPort => $port, ) or return ( "HTTP/1.0 500 Could not connect socket: $url", '' +); $sock->autoflush; my $netloc = $host; $netloc .= ":$port" if $port != 80; my $cookie_str = ''; if ( $cookies and ref($cookies) eq 'ARRAY' ) { $cookie_str .= "Cookie: $_\015\012" for @$cookies; } my $req = join '', "GET $path HTTP/1.0\015\012", "Host: $netloc\015\012", "Accept: /\015\012", "Accept-Encoding: \015\012", $cookie_str, "User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Window +s 98; )\015\012\015\012"; syswrite( $sock, $req ); my ($header, $content, $buffer) = ('','',''); $content .= $buffer while sysread( $sock, $buffer, 8192 ); $sock->close; return ( "HTTP/1.0 500 Failed to get any content for $url", '' ) unl +ess $content; ( $header, $content ) = split /\015\012\015\012\|\012\012\|\015\015/ +, $content, 2; return ( "HTTP/1.0 500 Failed to get a header for $url", '' ) unless + $header; # unfold the header $header =~ s/\015\012/\n/g; $header =~ s/\n\s+/ /g; $content \|\|= ''; return ( $header, $content ); } [download] cheers tachyon	[reply] [d/l]
Re: Fetching HTML Pages with Sockets by zentara (Archbishop) on Sep 20, 2004 at 13:29 UTC
Here is one, which is easy to understand. (I didn't write it, but it works fine). #!/usr/bin/perl # Very simple client program to search for # regular expressions on specified Web sites. # require 5.002; use strict; use Socket; # Perl 5 technique for declaring local variables. my ( $host, $in_addr, $proto, $port, $addr ); my ( $response, $page, $file, $pattern, %urls ); # Set up some URLs and patterns in an array hash my @pages = ( "zentara.net/~zentara/poems.html", "zentara.net" ); foreach $page (@pages) { ( $host, $file ) = split /\//, $page, 2; # Form the HTTP server address from the host # name and port number $in_addr = ( gethostbyname($host) )[4]; $port = 80; $addr = sockaddr_in( $port, $in_addr ); $proto = getprotobyname('tcp'); # Create an Internet protocol socket. socket( S, AF_INET, SOCK_STREAM, $proto ) or die "socket:$!"; # Connect our socket to the server socket. connect( S, $addr ) or die "connect:$!"; # For fflush on socket file handle after every # write. select(S); $\| = 1; select(STDOUT); # Send get request to server. print S "GET /$file HTTP/1.0\n\n"; print "===================$page===========================\n"; # Look for patterns in returned HTML. while (<S>) { foreach $page (@pages) { print; } } close(S); } exit; [download] I'm not really a human, but I play one on earth. flash japh	[reply] [d/l]
Re^2: Fetching HTML Pages with Sockets by melora (Scribe) on Sep 20, 2004 at 14:24 UTC
Thanks for posting that script. I've been experimenting with sockets, lately, but strictly in the realm of our lan. I had to give permission to the firewall to let me through, but once I did this worked nicely. Question: are there any security issues involved in fetching a page in this way? Just want to make sure whether I'm playing with fire, or just scrabbling in the dirt as I usually do.	[reply]
Re^3: Fetching HTML Pages with Sockets by amt (Monk) on Sep 20, 2004 at 18:39 UTC
I can't think of any security issues that would arise from pulling files down in using a socket and HTTP directives, but keep in mind that if the sockets are not set up properly, you may leave ports open, so making sure that you close the sockets explicitly is always a good measure. Also be sure to run perl with the Taint option if you plan on using the output from a remote location as the input on your script. amt	[reply]
Re^3: Fetching HTML Pages with Sockets by zentara (Archbishop) on Sep 21, 2004 at 13:01 UTC
"are there any security issues involved in fetching a page in this way?" It shouldn't be anymore of a security issue than retreiving it with Mozilla, or any other browser. As a matter of fact, I would worry more about Mozilla than Perl. You have to learn how your firewall works. There is a difference between opening up a server on a port listening for connections, and using a port to receive from a connection which YOU initiated. It's called an 'established' connection. One which you initiate, then open a port as part of that established connection. Ftp works this way too. The next time you fetch a file thru http, with a conventional browser, type "socklist" (as root) and lookm at the sockets and ports opened up to receive it. I'm not really a human, but I play one on earth. flash japh	[reply]


Welcome to the Monastery
	PerlMonks