comment on

The following was started 2 days ago, but I've been unable to finish it. It was suggested to me I post it as-is anyway, and if there is sufficient interest, I'll try to complete it later.

My first response to almost every question you've asked, is: It depends.

It mostly depends upon the requirements of the algorithm and/or data that needs processing. There are essentially 3 basic types of parallelism (with many variations):

IO parallelism:
The simplest of them all. Mostly IO-bound.
Typified by webservers and the like. Each execution context is essentially independent of all others.
Each execution context spends most of its time waiting (a device or communications partner) for something to do. And when it does get something to do, it does it relatively quickly and then goes back to waiting.
Task parallelism:
Potentially the most complex. A mixture of IO and cpu.
Typified by the Google Map-Reduce architecture. (Though that is (currently) applied at the macro (box) level).
Each execution context runs a different algorithm on independent data, but they need to coordinate and synchronise between each other. Here the algorithms (tasks) are overlapped by passing the stream of data from one stage to the next in smallish chunks. Think of a pipeline where one stream of data needs to be processed by several different algorithms, but serially--read; decode; transform; encode; write.
Data parallelism:
Potentially the biggest gains. CPU-bound.
Typified by image (X-ray; MRI;etc.) processing.
One large homogeneous dataset needs a one or two, often simple algorithms applied to the entire dataset. Small subset of the dataset are assigned to each execution context perform the same algorithm(s) on.

This type lends itself to event driven/state machine/select methods.
The problems arise when the process also need to perform some cpu-intensive processing in addition to the IO-driven processing. This necessitates the programmer injecting artificial breaks into the cpu-intensive code in order to ensure timely servicing of the IO events.
Whilst not onerous for a single single program with exclusive use of the hardware, it becomes necessary to re-tune the software for each cpu; OS; end even application mix; that ths code will run on. making for costly porting and on-going maintenance.
As mentioned above, this type is currently being dealt with at the macro-level.
Using clusters of commodity hardware and disk-based 'pipelines' for shared storage. Whilst reasonably effective for the last, current, and possibly next generations of 1, 2, and 4-core commodity boxes with 2 to 4 GB of ram, the cost of all the disk-IO between stages will start to make itself felt as we move to larger numbers of cores and ram per box.
Once you have cores ranging from 8 to 32; multiplied by simultaneous threading per core of 2 to 8; and anything from 64 to 512GB per box, it makes less and less sense to use processes and disk-based storage to handle the pipelines.
When the machine can store an entire partition of the dataset entirely in memory, it is far quicker to load the dataset into memory once, and have each stage of the pipeline operate on it in place. You could do this by running each stage over the dataset serially, but as with the current scheme, handing over smallish chunks from stage to stage allows you to overlap the stages and vastly reduce the end-to-end processing time. So, running all the stages as threads, bucket-chaining over a single, large shared dataset is a natural evolution of the processes and disks model that brings huge efficiency saving for negligible extra infrastructure costs.
Just a single shared integer between each stage that is incremented by the previous stage and read-only to the following stage. Indicating how far the previous stage has made it through the dataset, and where the following stage continues it next cycle. Utilises only simple, two-party condition signalling with no possibility of dead-locks, priority inversions or any of those other scare-monger nasties.
Huge performance gains secured from avoiding inter-stage pipeline IO costs.
This type is already happening in games and Computer Generated Imagery.
It simply doesn't make sense to partition these kinds of datasets (images etc.) across processes, let alone boxes. The shared-memory model and threading, is the only way to go. But again, the "big problems" of the shared memory model--deadlocking; priority inversion etc.--do not arise, because the each thread operates exclusively on its own partition of the overall dataset.
By linearly or recursively partitioning the dataset, no locking is required and each thread is free to run full-speed on its part of the problem to completion. The only synchronisation involved is the master thread waiting for the workers to finish before it does whatever needs doing with the results.
More and more, this kind of processing will be offloaded to specialist CPUs (dsps; gpus)(hereafter SPUs), for processing. However, with current setups this involves the transfer of data from cpu memory to SPUs memory and back again. And with potentially multiple processes needing the SPU's services, and with memory access already the bottleneck on performance, it will make more sense in the near future for Spus to do their thing by directly accessing the data in the main memory. Effectively making the SPUs cores extensions of the main cpu. We're already seeing this with SIMD and similar instruction sets.

The future is threaded. We are just waiting for the software to catch up with the hardware. And I fear we are going to have to wait for the next generation of programmers before that problem will be properly fixed.

Just as many of my generation have problems with using SMS--and with the language forms it generates--whilst it is simpler than tying shoelaces for the new generations. So it is with threading. Many in my generation only see the problems involved--not the potential.

Just as it tool the ubiquity of the web to drive the transition from th request-response model to the upcoming Web 2 era. So, it will require the ubiquity of threads and cores to drive the transition from forks&pipes to threads&shared state.It may take until the retirement of the Luddite generation for it to happen. But it will happen.

As you've rightly indicated, one of the primary drivers of the transition will be the evolution of computer languages that give easy--and well abstracted--access to the potential. Whilst many of those you cite are a adept at one or more of the types of concurrency, none of them are truly adept at all of them. And problems arise because real-world problems tend to require two or more of those types in the same application.

A secondary problem with most existing language abstractions of concurrency is that they tend to take one of two tacts:

A low-level approach--typified by C/C++ and libraries like Intel Threading Building Blocks--whereby the programmer is given access to, and required to exercise, full control over all aspects of the threading.
This not just enables, but forces the programmer to deal with all the issues of sharing state, not just for that data that needs to be shared, but all state! And that places the onus upon the programmer to ensure the segregation of per stage (per thread) state. A time consuming and rigorous task much better suited to automation by the compiler.
A high-level, encapsulated approach--typified by most of the concurrent FP languages like Haskell and Erlang--which removes most of the control from the programmers hands.
The problem here is that the FP (and even const correct procedural) languages prevent the programmer (at least without resorting to extraordinary procedures (with often "dangerous" sounding names)), from sharing state for those data that need to be shared, with the result that the data has to be needlessly and expensively copied, often many times, as it transitions through multi-stage pipelined processes; or as multiple workers concurrently do the same processing on their own small chunks of the total dataset, and then they are 're-assembled' back to a whole.
This compiler enforced (and needless) replication of state becomes a huge drain upon system resources via memory thrash, IO and communications protocols overheads. This can turn O(n) algorithms in to O(n^2) algorithms (or worse), once those overheads are factored into the equations. That detracts from, and can even completely negate, the benefits of concurrency. Add in the extra costs of concurrent development, maintenance and testing, and it is easy to see why people see little to be gained from the concurrency.

The solution is relatively simple, in concept at least. There needs to be a clear delineation between that state that needs to be shared--for efficient concurrency. And that state that mustn't be shared--for algorithmic safety. And the programmer must have explicit control over the former, whilst the compiler ensures and enforces the latter. In this way, thread procedures can be written without regard to the safety of local variables and state. Anything declared 'local', or 'non-shared', or better, any not explicitly marked as shared, is protected through compiler-enforced mechanisms from leaking between threads. At the same time, shared state can be passed around between threads as the needs of the algorithm dictate, without incurring huge penalties of copying involved in write-once variables.

Another way of viewing the two concurrency abstractions is:

the former seeks to give the programmer a zillion tools for controlling access to shared state and synchronising control flows between threads.
The problem here, beyond the well-known difficulties of getting this stuff right, is that the exposure and predominance of these mechanisms actively encourages programmers--especially those new to concurrency--to design their algorithms around utilising locks and synchronisation.
whilst the latter seeks to hide all such control within the abstraction beyond the programmers reach.
With this approach, you end up with either weird and unintuitive programming constructs and paradigms; or huge, unwieldy and slow compile-time analysis engines; or belt&braces, copy-everything and lock-everything always--"whether it needs it or not"--infra-structure and library code.

There is a third answer to the problems of locking and synchronisation. Avoid them! Sounds too simple to be a point worth making, but it is a fact that most algorithms and applications that can benefit from concurrency can, with a little care, be architected such that they use little or no locking or synchronisation, despite that they may manipulate large volumes of shared state, in place. The trick is to program the application so that only one thread attempts to access any given piece of data an any given time. I'm not talking about using mutexs to prevent concurrent access. More, mechanisms that don't allow the programmer to try. But without throwing the (procedural) baby out with the bath water.

This can be done today--in any language. It just requires the programmer to be aware of the techniques and apply them. But it would be a whole lot easier if programmers didn't have to become aware of the implementation details or re-invent them every time. To that end, there are several new abstractions that can help:

Memory delegates- handles to (subsections of) real shared memory, that can be passed between threads, but which can only be used by the thread holding the delegate. Old copies of the delegate become disabled. Runtime enforcement is fatal. Compile-time detection easy.
Thread-controlled shared variables:
This variable is shared--by threads A & B (& F ...) ONLY!
Access type controls on shared variables:
This variable is shared between threads A & B, but only thread A can write to it and only thread B can read from it. And B cannot read from it until A has written it. And thread A cannot write to it again until thread B had read it.
Bi-directional latching shared variables.
Variable shared between threads A & B, readable and writable by both; but cannot read until A has written; and once B has read, neither can read (and A cannot write again) until B has written; now only A can read and neither can read again (and B cannot write again), until A has written.
And so you encapsulate the request-response communications protocols in a single variable.
Self limiting queues:
Like thread controlled variables, which threads have access (and what type of access) is controlled, and both compile-time and run-time enforced. But in addition, the queues limit their own sizes such that any attempt to write when the queue is 'full' causes the writer to block. Any attempt to read when the queue is empty causes the reader to block.
You get self-regulating synchronisation with tunable buffering.
Declarative threading of functions for 'fire and forget' usage.

Delayed (lazy) results from threads (promises).

Instead of having explicitly create a thread passing the thread procedure address; retrieve the handle; and then later, explicitly call a blocking join to get the results. Instead, the above two combine to give something like:

sub threadProc1 :threaded { ... }
sub threadProc2 :threaded { ... }


my $result1 = threadProc1( @args );
my $result2 = threadProc2( @otherArgs );

unless( defined $result1 && defined $result2 ) {
    ## continue doing other stuff while we wait
}

## If the threads have completed by the time we reach this statement
## then the statements completes immediately. Otherwise it blocks unti
+l
## both results are available. Effectively joining both threads.
my $composite = $result1 + $result2;
[download]

/li>

With a few simple control and data sharing constructs such as these, all the scary, difficult parts of threading disappear under the covers, and the programmer can get back to concentrating uponn their application whilst knowing that it will benefit from whatever core and threads are available.

You can imagine the pipeline example (read; decode; transform; encode; write), from above being written something along the lines of;


my $pipeline :promise = TQmap {
    open $out, '>', $outFile;
    write( $out, $_ ) while $Qin->dequeue;
    close $out;
} 1, 10, TQmap {
    while( $Qin->dequeue ) {
        my $encoded = encode( $_ ); 
        $Qout->enqueue( $encoded );
    }
} 2, 20, TQmap {
    while( $Qin->dequeue ) {
        my $transformed = transform( $_ ); 
        $Qout->enqueue( $transformed );
    }
}, 4, 40, TQmap {
    while( $Qin->dequeue ) {
        my $decoded = decode( $_ );
        $Qout->enqueue( $decoded );
    }
}, 2, 20, TQmap { 
    ## read
    open my $in, '<'. $inFile;
    $Qout->enqueue( $_ ) while <$in>;
    close $in;
}, 1, 10, undef;

monitorKeyboard() until defined $pipeline;

if( $pipeline != SUCCESS ) {
    log( error, $pipeline );
}
else {
    log( success, $pipeline );
}

exit;
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: (OT) Programming languages for multicore computers by BrowserUk
in thread (OT) Programming languages for multicore computers by citromatik

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks