Sary has asked for the wisdom of the Perl Monks concerning the following question:

Hey there dear monks. I've been extensively learning perl for a week now through books, coding and your guidance. However, I lack the time for serious in depth research on the subject of socket programming while in a hurry due to my educational program to grasp some basics.

Heres the deal: Been writing a webcrawler which would run through the links on a given website and check for their availability. Oh, and im also very stuck. Heres the code:

use strict; use IO::Socket; use local::lib; print("Welcome to my webcrawler, \n Please specify a url in the following format : www.xxx.y \n"); chomp(my $yourURL=<STDIN>); my $sock = new IO::Socket::INET ( PeerAddr => $yourURL, PeerPort => '80', Proto => 'tcp', ); die "Could not create socket: $!\n" unless $sock; print ("TCP Connection Success. \n"); send($sock,"GET http:" . $yourURL . "HTTP/1.0\n\n",0); my @response=<SOCKET>; print("@response");

and here is the input and response i got in cmd:

C:\Users\User\Desktop>Crawlertst1.pl Welcome to my webcrawler, Please specify a url in the following format: www.xxx.y www.google.com TCP Connection Success.

What I expected was getting the url servers response printed using my @response var. That doesnt happen. Please help^^

Replies are listed 'Best First'.
Re: My Crawler script
by Corion (Patriarch) on Mar 10, 2011 at 07:50 UTC

    If you want to write a web crawler, look at WWW::Mechanize instead of trying to write the HTTP communication yourself, especially if you are not willing to make time for "serious in depth research on the subject of socket programming".

Re: My Crawler script
by atcroft (Abbot) on Mar 10, 2011 at 07:49 UTC

    While learning socket programming is useful (and you may wish to look at perlipc for more on that), might I suggest LWP, LWP::Simple, and LWP::UserAgent as a starting point, if all you want to build is a web crawler?

    Hope that helps.

Re: My Crawler script
by Eliya (Vicar) on Mar 10, 2011 at 08:16 UTC
    What I expected was getting the url servers response printed

    I agree with the tenor of the other replies, but in case you want the know what the problem with your code is...

    For one, you've connected $sock, but are then trying to read from SOCKET.

    Also, the page to get is specified relative to the server, i.e. without protocol "http://" and server name.  E.g.

    ... send($sock,"GET / HTTP/1.0\n\n",0); my @response=<$sock>; print("@response")
Re: My Crawler script
by pemungkah (Priest) on Mar 10, 2011 at 08:12 UTC
    Definitely agree - you're going way too low-level and making lots of extra work for yourself.

    Notes:

    1. Make sure you obey robots.txt. libwww-perl will give you the necessary tools for this.
    2. Make sure your crawler is polite and doesn't hammer the site to death, fetching pages as fast as it can. Add a short sleep() - even 1 second - between pages.
      Please explain note # 1. Whats robots? Gee whos the genius who voted -1? No one is forced to help me or even read my posts. Direct replies or reference for research are both appreciated but simple and sometimes dumb question will be ask anyway...
        robots.txt is an agreed-upon standard (see this site for lots of details) for limiting access to websites, specifically for crawlers.

        It defines

        • who is allowed to crawl the site
        • what paths they may or may not crawl at that site
        The robots.txt file is very important, as it keeps you from crawling links that could cause problems at the remote site, either by consuming large amounts of resources (e.g., an "add to shopping cart" link; following all of these on a site could generate a very large shopping cart indeed!) or by causing actual problems (e.g., a "delete" link or "report spam" link).

        Your crawler should read the robots.txt and follow its strictures - including skipping the site altogether if you see

        User-agent: * Disallow: /
        or a "disallow" that specifies your particular user agent.

        I should note that some sites are a bit weird about who crawls them; at Blekko we had a certain site that wasn't sure they agreed with us on some philosophical points, to put it kindly, and they specifically blocked our crawler. This could happen, and it's important to be polite and follow the robots.txt directives to prevent people from taking more aggressive action, like blocking your IP (or worse, entire IP block).

        (Edit: updated last sentence to clarify it slightly.)