=head1 GETTING A HANDLE ON IO =head2 Introduction This tutorial will delve into the naughtier, uglier parts of POSIX centric I/O. Herein are covered the nasty details of the various calls, and the layers they belong to, the different modes of input and output, and the combination of these aspects into practical examples. The tutorial will not attempt to explain the various ways to obtain the actual handles, except as necessary for specific examples. It's aim is to tutor on the various styles of I/O, that is, to manipulate given handles, in more specialized ways, in the hope that when the kind of interaction with a handle a user wants to attain is known, finding out how to get such a handle will be easy by using the reference (L, L, L, L, L, L). The main focus of the tutorial will be simplicity. Thereafter will come robustness, and then performance. What this means is that I will not systematically append C to every line of example, because I find that distracting. I will also not resort to ugly constructs to gain a little throughput. I think impure examples hinder my ability to convey my ideas clearly. Nuff said, on to the intro. We start with tiny baby steps, and then start striding forward. =head2 What is a filehandle? We'll start by covering the perl specific data type, which abstracts a stream of data. The filehandle. If you already think you know what you're doing, skip onwards a bit. This is really basic stuff. Perl's filehandles are points through which data is moved. You can refer to by name, or by storing them in a variable. The abstraction focuses round a metaphor of a sort of port hole, or pipe end, which your software can ask the OS to take data from and move it elsewhere, or put data on for your software to read. Data is moved via these orifices in chunks, coming out of or going into a normal variable, as a string. For example, lets say we've opened a file: open my $fh, "/some/file"; This stores a reference to a filehandle in the variable C<$fh>, which will grant you access to the data inside C. Perl allows us to ask for data to come out of filehandles in useful ways. Lets say we wanted a single line from C to be stored in the variable. my $var = <$fh>; But wait, how do we know which line will come out of C<$fh>? Well, the answer is "the next one". Filehandles are stream oriented. Data will arrive serially, and you can nibble at it, slowly progressing through the stream of data, till it ends. Specifically, handles having to do with files will have an implicit cursor, working behind the scenes, marking the point in the file which the handle is currently at. =head2 Plumbing your handles To move data in and out of file handles you use system calls. We'll start with the two most basic calls there are, L and L, which are available in perl as the builtin functions C and C. Their interfaces are pretty streight forward. Here is a subset of their functionality: sysread $fh, $variable_data_will_be_read_to, $how_much_data_to_read; C takes a filehandle as it's first argument, a variable as it's second, and a number as it's third, and read as many bytes as are specified in the number, from the handle, into the variable. syswrite $fh, $data_to_write; C will take a filehandle as it's first argument, and a string as it's second argument, and write the data from the string, to the filehandle. We already know of a way data can be put on a filehandle for us, which was telling the OS what file we'd like it to come from. Writing is just as flexible. The next section discusses ways of telling the OS not only what data is moved around, but where it will go. =head2 Directing data, a conceptual introduction Now that we've a hopefully firm grasp on how data enters and exits your software through handles, lets discuss it's movement, specifically, where it goes. The most common use for filehandles is for storing and retrieving data in files. We've already seen opening for reading. We can also write to a file: open my $fh, ">", "/some/file"; The C<< > >> argument will tell C that we want to write to the file (and also to erase it's contents first). When C<$fh> is opened for writing, we simply use C or one of it's deriviatives on it. But handles are not limited to just files. They can also be sockets, allowing the transfer of data between two unrelated processes, on possibly two different machines. A web server, for example, reads and writes on handles, receiving and sending data to browsers. Handles can be used as pipes to other processes, like to child processes using L or processes in a shell pipeline. The latter case is interesting, because it is done implicitly: cat file | tr a-z A-Z > file.uppercase That command will ask L to read the file C, and then print it to it's I. The standard output is a handle that you would normally output data to. What "normally" means in this context will be explained soon. Then L reads data from it's I, converts the data, and writes it to I standard output. It does this a chunk at a time. The shell redirect is perhaps the most interesting part: instead of L's STDOUT being connected to the terminal, where the user can read the data, the shell connected L's STDOUT to a handle of it's own, which is opened to C. I hope that the example fullfilled it's purpose in demonstrating the flexibility of the concept of piping data around through file handles. =head2 The going gets tough Now that we've covered the conceptual basics, lets look in greater detail at the most simple type of handle there is - a single purposed, non seekable, blocking handle. What single purposed means is that it can either read, or write. Not both. What seekable means, is that you can use C to change the cursor position for the file the handle abstracts. Not all handles abstract files, and thus not all handles have cursors. The ones that don't work more simply. A blocking handle refers to the type of semantics the system calls on the handle will work in. Non-seekable handles are implemented in terms of a buffer. The operating system associates some scratch space for it. As data comes into the buffer from somewhere (it could be your software writing to it, or somebody else if you're on the reading side), it is accumilated in that buffer. When data is read from the handle, it is taken from the buffer. What happens when there is not enough space in the buffer to write anymore? Or not enough data in the buffer to be read? This is where the blocking semantics of this kind of handle comes in. I'm oversimplifying, but basically, if the writing side wants to write a chunk of data that is too big for the space in the buffer, the operating system simply makes it wait with the write till the reading side asks for some data to come out. As data exits the buffer, more space is cleared out, and the writing can continue. Eventually all the data will be written to the buffer, and the write system call that the writing side executed will return. The same goes for reading: the read system call will simply wait until the data that was asked for has been made available. The state in which an operating system puts a process that is waiting for an IO call to complete is referred to as 'blocked'. When a process is blocked it leaves the hardware resources free for other processes to use. Blocking IO has an interesting property, in that it balances resource allocation in a pipeline. Lets say for example, that you ran this line of shell: cat file.gz | gzip -d | tr a-z A-Z L is doing very little work. It's a simple loop. It reads from the file, and writes to STDOUT. The data that L is getting, on the other hand, is processed more extensively. L performs some complex calculation on the data that enters it, and outputs derived data after this calculation. Then, finally, L performs simple actions, that while more complex than L, they are dwarfed by L. So what happens is that L will read some data, and then write some data, and then read some more data, and write some more data, until the buffer is full, and it's write will block. All this time L's and L's read calls were blocking. Eventually L's read will return, allowing it to do it's job, and finally emit data to L. It will turn out that most of the time L will use up CPU time, while L will spend most of it's time blocking in write calls, and L will spend most of it's time blocking in read calls, but will need some time for calculation too, otherwise L's writes will block. Plan (not really in order): blocking, nonseekable handles, and their conventions: fatal errors, SIGPIPE, etc. promote fault tolerant behavior by default UNIX pipelining mantra explain when blocking is not good, and continue with a single purpose, non seekable handle as used in a select loop to avoid it. mention epoll/kqueue, and perl interfaces for thereof. Mention Event/POE as more powerful multiplexing solutions. multiplexing, with a threading approach, as an alternative to select. and a non blocking approach, including SIGIO, nb vs. select, reliability, and latency versus blocking & selected IO. When not to use nonblocking bufferring, stdio vs sycalls, different functions, perlio Touch seekable handles briefly, and explain the semantics of blocking and so on as far as file io is concerned, mention files, and discuss that not all things in the filesystem are files: devices (char and block), named pipes, unix domain sockets... Sockets Introduce non stream handles, and discuss the implementations of socket IO, and it's multilayered nature, the relationship between streams and datagrams. Implications of networking envs. discuss IO on shared handles discuss accept() on shared sockets in a preforked env appendix: faux IO: open ref, perlio layerrs and ties