Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Fetching HTML Pages with Sockets

by amt (Monk)
on Sep 19, 2004 at 18:15 UTC ( [id://392209]=perlquestion: print w/replies, xml ) Need Help??

amt has asked for the wisdom of the Perl Monks concerning the following question:

Gentlemen,
I am attempting to retrieve a web page from a server using sockets and HTTP commands, due to limitations of my server.
Here is my code segment
use Socket; use CGI; my $page = connect_host(); print "$page"; sub connect_host { my $remote = gethostbyname("football.fantasysports.yahoo.com") +; my $proto = getprotobyname('tcp'); my $port = 80; my $remote_host = sockaddr_in($port,$remote); socket(SOCK,PF_INET,SOCK_STREAM,$proto); connect(SOCK, $remote_host); print SOCK "GET / HTTP/1.0\r\n\r\n"; my $html = <SOCK>; return $html; }


When executing the code, I recieve the following error:
Use of uninitialized value in string at ./pcs.pl line 16.

Thanks in advance.

amt

Replies are listed 'Best First'.
Re: Fetching HTML Pages with Sockets
by ikegami (Patriarch) on Sep 19, 2004 at 18:47 UTC

    That code works for me without any warnings, although it would hang until I called flush(SOCK) after the print, where flush is defined as
    sub flush { select($_[0]); my $t=$|; $|=1; $|=$t; }.

    All of the functions you called above return undef on error, and set $! to an error number and "$!" to an error message. Why don't you check for errors?

    btw, LWP comes with perl. You probably should be using that. For starters, it knows that "\r\n" will not always do what you want. Use "\xD\xA" instead of "\r\n". Also, should do local *SOCK to limit the scope of SOCK.

      Unfortunately, when I tried to use LWP::UserAgent it was not included in @INC, and getting my sysadmin to install anything, well, won't happen.
      When using the hex values, I get barked at for an invalid hex value.
      amt
        odd, it works for me in perl 5.6.1 and 5.8.0 How about octal? "\015\012"
Re: Fetching HTML Pages with Sockets
by davido (Cardinal) on Sep 20, 2004 at 03:05 UTC

    I just thought I would mention this...

    LWP::Simple is a pure-Perl module, with only one dependency: HTTP::Status. That too is a pure-Perl module, with no dependencies. So both could be installed to a ~/lib path within your own user path, without sysadmin intervention, and without tricky compilation steps. It's pretty easy.

    After installing them, you would only need to adjust @INC so that perl can find where you've put those modules.

    Doing so would allow you to get pages via HTTP as easily as "my $page = get('http://www.perlmonks.org');."


    Dave

      I would also add that Win32 Perl installations have "get.bat" in the /Perl/bin dir so you can "get" pages from the command line or you can write simple one liners to fetch one or many pages or binaries :-)
      perl -le "`get http://www.perlmonks.com/index.pl?parent=392273;node_id +=3333 > node_$_.htm` for qw /3333/;"

      This example fetches this node and the one above it.
      JamesNC
Re: Fetching HTML Pages with Sockets
by tachyon (Chancellor) on Sep 19, 2004 at 23:45 UTC

    You code won't work with many servers as they require a Host: header.

    sub socket_get { my ( $url, $cookies ) = @_; return ("HTTP/1.0 500 Invalid URL $url", '' ) unless $url =~ m!^http://([^/:\@]+)(?::(\d+))?(/\S*)?$!; my $host = $1; my $port = $2 || 80; my $path = $3; $path = "/" unless defined $path; require IO::Socket::INET; my $sock = IO::Socket::INET->new( PeerAddr => $host, Proto => 'tcp', PeerPort => $port, ) or return ( "HTTP/1.0 500 Could not connect socket: $url", '' +); $sock->autoflush; my $netloc = $host; $netloc .= ":$port" if $port != 80; my $cookie_str = ''; if ( $cookies and ref($cookies) eq 'ARRAY' ) { $cookie_str .= "Cookie: $_\015\012" for @$cookies; } my $req = join '', "GET $path HTTP/1.0\015\012", "Host: $netloc\015\012", "Accept: */*\015\012", "Accept-Encoding: *\015\012", $cookie_str, "User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Window +s 98; )\015\012\015\012"; syswrite( $sock, $req ); my ($header, $content, $buffer) = ('','',''); $content .= $buffer while sysread( $sock, $buffer, 8192 ); $sock->close; return ( "HTTP/1.0 500 Failed to get any content for $url", '' ) unl +ess $content; ( $header, $content ) = split /\015\012\015\012|\012\012|\015\015/ +, $content, 2; return ( "HTTP/1.0 500 Failed to get a header for $url", '' ) unless + $header; # unfold the header $header =~ s/\015\012/\n/g; $header =~ s/\n\s+/ /g; $content ||= ''; return ( $header, $content ); }

    cheers

    tachyon

Re: Fetching HTML Pages with Sockets
by zentara (Archbishop) on Sep 20, 2004 at 13:29 UTC
    Here is one, which is easy to understand. (I didn't write it, but it works fine).
    #!/usr/bin/perl # Very simple client program to search for # regular expressions on specified Web sites. # require 5.002; use strict; use Socket; # Perl 5 technique for declaring local variables. my ( $host, $in_addr, $proto, $port, $addr ); my ( $response, $page, $file, $pattern, %urls ); # Set up some URLs and patterns in an array hash my @pages = ( "zentara.net/~zentara/poems.html", "zentara.net" ); foreach $page (@pages) { ( $host, $file ) = split /\//, $page, 2; # Form the HTTP server address from the host # name and port number $in_addr = ( gethostbyname($host) )[4]; $port = 80; $addr = sockaddr_in( $port, $in_addr ); $proto = getprotobyname('tcp'); # Create an Internet protocol socket. socket( S, AF_INET, SOCK_STREAM, $proto ) or die "socket:$!"; # Connect our socket to the server socket. connect( S, $addr ) or die "connect:$!"; # For fflush on socket file handle after every # write. select(S); $| = 1; select(STDOUT); # Send get request to server. print S "GET /$file HTTP/1.0\n\n"; print "===================$page===========================\n"; # Look for patterns in returned HTML. while (<S>) { foreach $page (@pages) { print; } } close(S); } exit;

    I'm not really a human, but I play one on earth. flash japh
      Thanks for posting that script. I've been experimenting with sockets, lately, but strictly in the realm of our lan. I had to give permission to the firewall to let me through, but once I did this worked nicely. Question: are there any security issues involved in fetching a page in this way? Just want to make sure whether I'm playing with fire, or just scrabbling in the dirt as I usually do.
        I can't think of any security issues that would arise from pulling files down in using a socket and HTTP directives, but keep in mind that if the sockets are not set up properly, you may leave ports open, so making sure that you close the sockets explicitly is always a good measure.

        Also be sure to run perl with the Taint option if you plan on using the output from a remote location as the input on your script.

        amt
        "are there any security issues involved in fetching a page in this way?"

        It shouldn't be anymore of a security issue than retreiving it with Mozilla, or any other browser. As a matter of fact, I would worry more about Mozilla than Perl.

        You have to learn how your firewall works. There is a difference between opening up a server on a port listening for connections, and using a port to receive from a connection which YOU initiated. It's called an 'established' connection. One which you initiate, then open a port as part of that established connection. Ftp works this way too. The next time you fetch a file thru http, with a conventional browser, type "socklist" (as root) and lookm at the sockets and ports opened up to receive it.


        I'm not really a human, but I play one on earth. flash japh

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://392209]
Approved by Arunbear
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2024-04-26 08:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found