in reply to Re^3: Parrot, threads & fears for the future.
in thread Parrot, threads & fears for the future.
Unfortunately, the cost of using iThreads shared memory, required for the read and write buffers, is so high that using iThreads to do overlapped IO is impractical:
cmpthese -1, { shared => q[ my $x : shared; ++$x for 1 .. 1e6 ], nonshared => q[ my $x ; ++$x for 1 .. 1e6 ], };; (warning: too few iterations for a reliable count) s/iter shared nonshared shared 1.31 -- -89% nonshared 0.141 834% --
There are other problems also. Whilst thread == interpreter, each read and write means giving up that threads timeslice and a task switch, before the transform thread can do work. But, with interpreter == a kernel thread, when the task switch occurs, there is no guarantee (in fact very low possibility), that the transform thread will get the next timeslice as the round robin is on all kernel threads. Those of this process and all others in the system. The upshot of that is that it takes at least 3 (or more) task switches to read and transform a record and at least 3 more to write one.
The idealised situation would be that as soon as the transform thread has got hold of the last record read, the read thread would issue the read for the next one--going straight in to the IO wait--and the transform thread would be able to continue the timeslice. You cannot arrange for that to happen using kernel threads. At least not on a single cpu processor where it would be of most benefit.
If thread != interpreter. IE. if more than one thread could be run within a single interpreter, then you could use cooperative (user-space/user dispatched) threads (fibres in Win32 terms. unbound threads in Solaris terms), to achieve this.
I've truncated the write thread participation but it is essentially a mirror image of the read thread. So, with 3 cooperatively dispatched user threads running in the same kernel thread, the process is able to fully utilise every timeslice allocated to it by the OS.
Using 3 kernel threads, 2 out of every 3 timeslices allocated to the process have to be given up almost immediately due to IO waits. The time-line for each read-transform-write cycle (simplistically) looks something like:
read | xform | write thread | thread | thread ------------|---------------|--------------- Issue read | wait lock(in) | wait lock(out) IO wait | | | | -------------------------------------------- | | ~ ~ some unknown number of kernel task switches ~ ~ | | ---------------------------------------------- Read completes " | wait lock(out) signal record | issue next read | IO wait | wait lock(in)| -------------------------------------------- | | ~ ~ some unknown number of kernel task switches ~ ~ | | ---------------------------------------------- IOwait |obtain lock(in)| wait lock(out) | do stuff | | do stuff | | wait lock(out)| | signal write | | loop -------------------------------------------- | | ~ ~ some unknown number of kernel task switches ~ ~ | | ---------------------------------------------- | wait lock(in) | obtain lock(out) | | write out | | IO wait | | | |
Even better than the AIO/fibres mechanism above, is overlapped-IO combined with asynchronous procedure calls (APC), but that is "too Redmond" for serious consideration here.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Parrot, threads & fears for the future.
by sandfly (Beadle) on Nov 01, 2006 at 12:07 UTC |