Sary has asked for the wisdom of the Perl Monks concerning the following question:

Hello and thank you for your time. Being new to perl I might pop off some offtopic or basic questions here i only hope you guys be patient:]

I've been toying around with building a basic webcrawler to learn, heres where im stuck: First I would need to access the Socket module :

Now I stumbled across two ways to do so:

use IO::Socket; use Socket;
What would be the difference?

Next, I'd need to create an actual socket:

socket(SOCKET, PF_INET, SOCK_STREAM, getprotobyname('tcp'))

Now, the functions parameters got me at loss.

SOCKET - This field contains a pointer to an already existing socket? Huh?

PF_INET - Could also be AF_INET. Either Address Family or protocol family. What would that mean? SOCK_STREAM - Going through some existing crawler code, I couldn't even locate where this stream came from. Is it there by default?

getprotobyname('tcp') - Either TCP, UDP or pure datagram socket type? I guess i get this one.

Than, in some places i've read that I would need to run bind(SOCKET,ADDRESS) to assign an ip to the socket.(My ip i guess), but in the example im working with this isnt included. Where is source IP assigned than?

Sometimes, I can see a socket created differently in perl, with use IO::Socket::INET; What those different types/modules of sockets would be used for? A socket created using the ::INET module also accepts Peer Address. Why is it implemented differently?

Now, we want to send data through our socket using the SEND function(is it called a function or a sub?) send(SOCKET,"GET http://google.com HTTP/1.0\n\n" Im left wondering, what about the TCP handshake? Seems we can just skip it and ask for the resource off the bat from the server. Is that always the case?

And to the final question, receiving a response we would need our server-socket running in listen mode with an accept on a while(always) loop. How do I implement code that would allow me to efficiently store the web page so that I scan it for potential links to other sites?

Waiting eagerly for your guidance and tips, Alex. Hey guys! Thanks for your replies, I will def take a look at stevens network programming and for the LWP, I indeed want to understand socket programming before I move to the highter level programming. So, if anyone here is a socketing pro, I would appreciate it if he cant go step by step with me here. Thanks!

Replies are listed 'Best First'.
Re: some SOCKET action
by roboticus (Chancellor) on Mar 07, 2011 at 14:00 UTC

    Sary:

    If you just want a web crawler, I'd suggest you start with LWP or WWW::Mechanize so you can avoid the socket details. But if you want to attack your problem from the socket level to learn, then I'd suggest you get a good reference on socket programming to learn a bit more about the subject. Writing code for something you don't understand can be frustrating. While I'm certain that the University of Google can get you started, I recommend the Stevens book on UNIX network programming, check out the sample chapter referenced here. Even though it's for UNIX, most everything translates nicely to Windows as well.

    The same author has several other books on networking that may be interesting to you. (Note: I have no relation with the author, save having purchased, read and enjoyed the book I recommended.)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

    Update: Repaired second CPAN link.

Re: some SOCKET action
by atemon (Chaplain) on Mar 07, 2011 at 13:59 UTC
Re: some SOCKET action
by fidesachates (Monk) on Mar 07, 2011 at 20:57 UTC
    "use IO::Socket;
    use Socket;
    What would be the difference?"
    While I could repeat back to you what perl docs says, I imagine you could do that yourself. Instead, I found something which may be of use here at perl monks.
    http://www.perlmonks.org/?node_id=104273

    SOCKET - This field contains a pointer to an already existing socket? Huh?
    No, you provide the filehandle you want to use to reference the socket that this function will make. It's similar to open

    PF_INET - Could also be AF_INET. Either Address Family or protocol family. What would that mean?
    Ok, so in today's networking programming world, there is no difference (says that with his fingers crossed or whatever kids do nowadays to indicate they're lying). Use AF since it's the commonly accepted method of doing things now. Not that it matters. In the socket.h which socket.pm is based off of, they're defined to be the same thing#define PF_INET AF_INET

    SOCK_STREAM - Going through some existing crawler code, I couldn't even locate where this stream came from. Is it there by default?
    This indicates what type of sock you want to create. SOCK_STREAM indicates tcp/ip. This is what you want for a webcrawler. The other options can be found here http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.progcomm/doc/progcomc/skt_types.htm

    Than, in some places i've read that I would need to run bind(SOCKET,ADDRESS) to assign an ip to the socket.(My ip i guess), but in the example im working with this isnt included. Where is source IP assigned than?
    Bind is only needed if the socket you're using is going to start listening. You need to bind the socket to the address and port that that you will be listening on. Otherwise if the socket is going to be used for connecting, you don't need bind, you need connect. When you bind, you choose the address and port, it doesn't "come" from anywhere.

    Im left wondering, what about the TCP handshake? Seems we can just skip it and ask for the resource off the bat from the server. Is that always the case?
    You're under the impression that sockets operate on the transport level. That is incorrect(to a certain degree). It operates on the level above that,the application level. I say this because when you use send, the data you send is all contained in the payload portion of the packets. You don't specify any of the tcp headers. Thus as a programmer using sockets, you're operating on the application level. When you declared this socket to be a tcp/IP socket, the connect statement takes care of the initial handshake and syncs. If you want to verify this for yourself, run wireshark and view the handshake.

    As to your final question regarding storage of web pages, I think I'll leave that question to another fellow monk since my area of expertise is networks. Hope this information helped.
      It did alot:], thanks. Il get to all that reeding adviced earlier.