"use IO::Socket;
use Socket;
What would be the difference?"
While I could repeat back to you what perl docs says, I imagine you could do that yourself. Instead, I found something which may be of use here at perl monks.
http://www.perlmonks.org/?node_id=104273
SOCKET - This field contains a pointer to an already existing socket? Huh?
No, you provide the filehandle you want to use to reference the socket that this function will make. It's similar to
open
PF_INET - Could also be AF_INET. Either Address Family or protocol family. What would that mean?
Ok, so in today's networking programming world, there is no difference (says that with his fingers crossed or whatever kids do nowadays to indicate they're lying). Use AF since it's the commonly accepted method of doing things now. Not that it matters. In the socket.h which socket.pm is based off of, they're defined to be the same thing
#define PF_INET AF_INET
SOCK_STREAM - Going through some existing crawler code, I couldn't even locate where this stream came from. Is it there by default?
This indicates what type of sock you want to create. SOCK_STREAM indicates tcp/ip. This is what you want for a webcrawler. The other options can be found here
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.progcomm/doc/progcomc/skt_types.htm
Than, in some places i've read that I would need to run bind(SOCKET,ADDRESS) to assign an ip to the socket.(My ip i guess), but in the example im working with this isnt included. Where is source IP assigned than?
Bind is only needed if the socket you're using is going to start listening. You need to bind the socket to the address and port that that you will be listening on. Otherwise if the socket is going to be used for connecting, you don't need bind, you need connect. When you bind, you choose the address and port, it doesn't "come" from anywhere.
Im left wondering, what about the TCP handshake? Seems we can just skip it and ask for the resource off the bat from the server. Is that always the case?
You're under the impression that sockets operate on the transport level. That is incorrect(to a certain degree). It operates on the level above that,the application level. I say this because when you use
send, the data you send is all contained in the payload portion of the packets. You don't specify any of the tcp headers. Thus as a programmer using sockets, you're operating on the application level. When you declared this socket to be a tcp/IP socket, the connect statement takes care of the initial handshake and syncs. If you want to verify this for yourself, run wireshark and view the handshake.
As to your final question regarding storage of web pages, I think I'll leave that question to another fellow monk since my area of expertise is networks. Hope this information helped.