Re^4: Parrot, threads & fears for the future.

Unfortunately, the cost of using iThreads shared memory, required for the read and write buffers, is so high that using iThreads to do overlapped IO is impractical:

cmpthese -1, {
    shared => q[ my $x : shared; ++$x for 1 .. 1e6 ],
 nonshared => q[ my $x         ; ++$x for 1 .. 1e6 ],
};;
            (warning: too few iterations for a reliable count)

          s/iter    shared nonshared
shared      1.31        --      -89%
nonshared  0.141      834%        --
[download]

There are other problems also. Whilst thread == interpreter, each read and write means giving up that threads timeslice and a task switch, before the transform thread can do work. But, with interpreter == a kernel thread, when the task switch occurs, there is no guarantee (in fact very low possibility), that the transform thread will get the next timeslice as the round robin is on all kernel threads. Those of this process and all others in the system. The upshot of that is that it takes at least 3 (or more) task switches to read and transform a record and at least 3 more to write one.

The idealised situation would be that as soon as the transform thread has got hold of the last record read, the read thread would issue the read for the next one--going straight in to the IO wait--and the transform thread would be able to continue the timeslice. You cannot arrange for that to happen using kernel threads. At least not on a single cpu processor where it would be of most benefit.

If thread != interpreter. IE. if more than one thread could be run within a single interpreter, then you could use cooperative (user-space/user dispatched) threads (fibres in Win32 terms. unbound threads in Solaris terms), to achieve this.

The transform thread copies the previously read record and transfers control to the read thread.
The read thread issues an asyncIO request for the next record and then transfers control back to the transform thread.
When the transform thread finishes with this record it gives it to the write thread; loops back and transfers control back to the read thread.
The read thread then does it's wait for io completion, which normally will have already completed whilst the transform thread was running, so no wait occurs. So, it transfers control back to the read thread which copies the new record and we're back to step 1.

I've truncated the write thread participation but it is essentially a mirror image of the read thread. So, with 3 cooperatively dispatched user threads running in the same kernel thread, the process is able to fully utilise every timeslice allocated to it by the OS.

Using 3 kernel threads, 2 out of every 3 timeslices allocated to the process have to be given up almost immediately due to IO waits. The time-line for each read-transform-write cycle (simplistically) looks something like:

 read       |   xform       |  write
thread      |  thread       |  thread
------------|---------------|---------------   
Issue read  | wait lock(in) | wait lock(out)
IO wait     |               |               
            |               |               
--------------------------------------------
            |               |               
            ~               ~              
 some unknown number of kernel task switches
            ~               ~              
            |               |               
----------------------------------------------
Read completes     "        |  wait lock(out)
signal record               |               
issue next read             |               
IO wait     |  wait lock(in)|               
--------------------------------------------
            |               |               
            ~               ~              
 some unknown number of kernel task switches
            ~               ~              
            |               |               
----------------------------------------------
 IOwait     |obtain lock(in)|  wait lock(out)
            | do stuff      |               
            | do stuff      |               
            | wait lock(out)|
            | signal write  |               
            | loop
--------------------------------------------
            |               |               
            ~               ~              
 some unknown number of kernel task switches
            ~               ~              
            |               |               
----------------------------------------------
            | wait lock(in) | obtain lock(out)
            |               | write out     
            |               | IO wait       
            |               |               
            |               |
[download]

Even better than the AIO/fibres mechanism above, is overlapped-IO combined with asynchronous procedure calls (APC), but that is "too Redmond" for serious consideration here.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^4: Parrot, threads & fears for the future. Select or Download Code

Replies are listed 'Best First'.
Re^5: Parrot, threads & fears for the future. by sandfly (Beadle) on Nov 01, 2006 at 12:07 UTC
Thanks to you and tye. It's not necesssarily as bad as two-out-of-three timeslices immediately given up for I/O - the transform could potentially consume many timeslices. However I take your point that the reader and writer will probably make poor use of their timeslices.	[reply]