Sary has asked for the wisdom of the Perl Monks concerning the following question:

Hey there. Been writing a web crawler and am stuck on web requests.

Heres the code:

use strict; use IO::Socket; use local::lib; print("Welcome to my webcrawler, \n Please specify a url in the following format : www.xxx.y \n"); chomp(my $yourURL=<STDIN>); my $sock = new IO::Socket::INET ( PeerAddr => $yourURL, PeerPort => '80', Proto => 'tcp', ); die "Could not create socket: $!\n" unless $sock; print ("TCP Connection Success. \n"); send($sock,"GET /HTTP/1.1\n\n",0); my @response=<$sock>; print("@response");

The response I always get is 404 not found. Cant seem to find my mistake, would someone light my fire?:]

Thanks,

alex.

Replies are listed 'Best First'.
Re: HTTP request
by Eliya (Vicar) on Mar 13, 2011 at 14:47 UTC
     send($sock,"GET /HTTP/1.1\n\n",0);

    You need a space between the '/' (standing for the resource to get — here top-level / document root) and the protocol (HTTP/1.1):

    send($sock,"GET / HTTP/1.0\n\n",0); ^

    See HTTP Request message.

    ___

    P.S. You should normally use \r\n as newlines, as specified in the HTTP protocol — although \n is typically also understood (most web servers and browsers are rather error-tolerant).

    If you use print instead of send, you can also apply the :crlf PerlIO layer:

    binmode $sock, ":crlf"; print $sock "GET / HTTP/1.0\n\n";

    (for some reason, the layer seems to be ignored with send)

      That's it.

      And because the whole /HTTP/1.1 bit looks like the resource, the server is assuming HTTP/1.0 rather than HTTP/1.1, so it isn't issuing a Bad Request response for your lack of a Host Header.

      -sauoq
      "My two cents aren't worth a dime.";
        And because the whole /HTTP/1.1 bit looks like the resource, the server is assuming HTTP/1.0 rather than HTTP/1.1

        Not quite. The entire request lacks a protocol specification, so the server will treat it as a HTTP/0.9 request for a resource named /HTTP/1.1. That resource usually does not exists and results in a "404 Not Found" response.

        Of course, servers are free to refuse HTTP/0.9 and even HTTP/1.0 with a "400 Bad Request" response, but I have not yet seen such a server.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: HTTP request
by marto (Cardinal) on Mar 13, 2011 at 13:45 UTC
      Yes, i'm trying to figure out my own code for now. And seems i'v bit more than i can chew.

      Anyways my efforts on trying to learn some socket programming through perl have been less than appreciated in my academy and I am forced to focus on other topics. Though im still gonna consult you people for advices under cover:].

        As you've discovered there is more to a crawler than socket programming. You don't seem to understand the basics or appreciate what is involved in implementing something like this properly, respecting robots.txt (or even learning what robots.txt is), not hammering servers etc.

        I suggest you take the time to learn the basics, review the advice people have given to you on these topics, see Network Programming from the tutorials section.

Re: HTTP request
by sauoq (Abbot) on Mar 13, 2011 at 14:38 UTC

    You are reinventing the wheel.

    If you insist on doing that, fine... but you'll have to learn a lot of details you could otherwise ignore.

    One detail you'll have to learn about HTTP 1.1 is that a Host header is required. Not sending it should return a Bad Request rather than a Not Found though, so I'm not sure that's your problem.

    -sauoq
    "My two cents aren't worth a dime.";