The following was started 2 days ago, but I've been unable to finish it. It was suggested to me I post it as-is anyway, and if there is sufficient interest, I'll try to complete it later.


My first response to almost every question you've asked, is: It depends.

It mostly depends upon the requirements of the algorithm and/or data that needs processing. There are essentially 3 basic types of parallelism (with many variations):

  1. IO parallelism:

    The simplest of them all. Mostly IO-bound.

    Typified by webservers and the like. Each execution context is essentially independent of all others.

    Each execution context spends most of its time waiting (a device or communications partner) for something to do. And when it does get something to do, it does it relatively quickly and then goes back to waiting.

  2. Task parallelism:

    Potentially the most complex. A mixture of IO and cpu.

    Typified by the Google Map-Reduce architecture. (Though that is (currently) applied at the macro (box) level).

    Each execution context runs a different algorithm on independent data, but they need to coordinate and synchronise between each other. Here the algorithms (tasks) are overlapped by passing the stream of data from one stage to the next in smallish chunks. Think of a pipeline where one stream of data needs to be processed by several different algorithms, but serially--read; decode; transform; encode; write.

  3. Data parallelism:

    Potentially the biggest gains. CPU-bound.

    Typified by image (X-ray; MRI;etc.) processing.

    One large homogeneous dataset needs a one or two, often simple algorithms applied to the entire dataset. Small subset of the dataset are assigned to each execution context perform the same algorithm(s) on.

  1. This type lends itself to event driven/state machine/select methods.

    The problems arise when the process also need to perform some cpu-intensive processing in addition to the IO-driven processing. This necessitates the programmer injecting artificial breaks into the cpu-intensive code in order to ensure timely servicing of the IO events.

    Whilst not onerous for a single single program with exclusive use of the hardware, it becomes necessary to re-tune the software for each cpu; OS; end even application mix; that ths code will run on. making for costly porting and on-going maintenance.

  2. As mentioned above, this type is currently being dealt with at the macro-level.

    Using clusters of commodity hardware and disk-based 'pipelines' for shared storage. Whilst reasonably effective for the last, current, and possibly next generations of 1, 2, and 4-core commodity boxes with 2 to 4 GB of ram, the cost of all the disk-IO between stages will start to make itself felt as we move to larger numbers of cores and ram per box.

    Once you have cores ranging from 8 to 32; multiplied by simultaneous threading per core of 2 to 8; and anything from 64 to 512GB per box, it makes less and less sense to use processes and disk-based storage to handle the pipelines.

    When the machine can store an entire partition of the dataset entirely in memory, it is far quicker to load the dataset into memory once, and have each stage of the pipeline operate on it in place. You could do this by running each stage over the dataset serially, but as with the current scheme, handing over smallish chunks from stage to stage allows you to overlap the stages and vastly reduce the end-to-end processing time. So, running all the stages as threads, bucket-chaining over a single, large shared dataset is a natural evolution of the processes and disks model that brings huge efficiency saving for negligible extra infrastructure costs.

    Just a single shared integer between each stage that is incremented by the previous stage and read-only to the following stage. Indicating how far the previous stage has made it through the dataset, and where the following stage continues it next cycle. Utilises only simple, two-party condition signalling with no possibility of dead-locks, priority inversions or any of those other scare-monger nasties.

    Huge performance gains secured from avoiding inter-stage pipeline IO costs.

  3. This type is already happening in games and Computer Generated Imagery.

    It simply doesn't make sense to partition these kinds of datasets (images etc.) across processes, let alone boxes. The shared-memory model and threading, is the only way to go. But again, the "big problems" of the shared memory model--deadlocking; priority inversion etc.--do not arise, because the each thread operates exclusively on its own partition of the overall dataset.

    By linearly or recursively partitioning the dataset, no locking is required and each thread is free to run full-speed on its part of the problem to completion. The only synchronisation involved is the master thread waiting for the workers to finish before it does whatever needs doing with the results.

    More and more, this kind of processing will be offloaded to specialist CPUs (dsps; gpus)(hereafter SPUs), for processing. However, with current setups this involves the transfer of data from cpu memory to SPUs memory and back again. And with potentially multiple processes needing the SPU's services, and with memory access already the bottleneck on performance, it will make more sense in the near future for Spus to do their thing by directly accessing the data in the main memory. Effectively making the SPUs cores extensions of the main cpu. We're already seeing this with SIMD and similar instruction sets.

The future is threaded. We are just waiting for the software to catch up with the hardware. And I fear we are going to have to wait for the next generation of programmers before that problem will be properly fixed.

Just as many of my generation have problems with using SMS--and with the language forms it generates--whilst it is simpler than tying shoelaces for the new generations. So it is with threading. Many in my generation only see the problems involved--not the potential.

Just as it tool the ubiquity of the web to drive the transition from th request-response model to the upcoming Web 2 era. So, it will require the ubiquity of threads and cores to drive the transition from forks&pipes to threads&shared state.It may take until the retirement of the Luddite generation for it to happen. But it will happen.

As you've rightly indicated, one of the primary drivers of the transition will be the evolution of computer languages that give easy--and well abstracted--access to the potential. Whilst many of those you cite are a adept at one or more of the types of concurrency, none of them are truly adept at all of them. And problems arise because real-world problems tend to require two or more of those types in the same application.

A secondary problem with most existing language abstractions of concurrency is that they tend to take one of two tacts:

The solution is relatively simple, in concept at least. There needs to be a clear delineation between that state that needs to be shared--for efficient concurrency. And that state that mustn't be shared--for algorithmic safety. And the programmer must have explicit control over the former, whilst the compiler ensures and enforces the latter. In this way, thread procedures can be written without regard to the safety of local variables and state. Anything declared 'local', or 'non-shared', or better, any not explicitly marked as shared, is protected through compiler-enforced mechanisms from leaking between threads. At the same time, shared state can be passed around between threads as the needs of the algorithm dictate, without incurring huge penalties of copying involved in write-once variables.

Another way of viewing the two concurrency abstractions is:

There is a third answer to the problems of locking and synchronisation. Avoid them! Sounds too simple to be a point worth making, but it is a fact that most algorithms and applications that can benefit from concurrency can, with a little care, be architected such that they use little or no locking or synchronisation, despite that they may manipulate large volumes of shared state, in place. The trick is to program the application so that only one thread attempts to access any given piece of data an any given time. I'm not talking about using mutexs to prevent concurrent access. More, mechanisms that don't allow the programmer to try. But without throwing the (procedural) baby out with the bath water.

This can be done today--in any language. It just requires the programmer to be aware of the techniques and apply them. But it would be a whole lot easier if programmers didn't have to become aware of the implementation details or re-invent them every time. To that end, there are several new abstractions that can help:

  1. Memory delegates- handles to (subsections of) real shared memory, that can be passed between threads, but which can only be used by the thread holding the delegate. Old copies of the delegate become disabled. Runtime enforcement is fatal. Compile-time detection easy.
  2. Thread-controlled shared variables:

    This variable is shared--by threads A & B (& F ...) ONLY!

  3. Access type controls on shared variables:

    This variable is shared between threads A & B, but only thread A can write to it and only thread B can read from it. And B cannot read from it until A has written it. And thread A cannot write to it again until thread B had read it.

  4. Bi-directional latching shared variables.

    Variable shared between threads A & B, readable and writable by both; but cannot read until A has written; and once B has read, neither can read (and A cannot write again) until B has written; now only A can read and neither can read again (and B cannot write again), until A has written.

    And so you encapsulate the request-response communications protocols in a single variable.

  5. Self limiting queues:

    Like thread controlled variables, which threads have access (and what type of access) is controlled, and both compile-time and run-time enforced. But in addition, the queues limit their own sizes such that any attempt to write when the queue is 'full' causes the writer to block. Any attempt to read when the queue is empty causes the reader to block.

    You get self-regulating synchronisation with tunable buffering.

  6. Declarative threading of functions for 'fire and forget' usage.
  7. Delayed (lazy) results from threads (promises).

    Instead of having explicitly create a thread passing the thread procedure address; retrieve the handle; and then later, explicitly call a blocking join to get the results. Instead, the above two combine to give something like:

    sub threadProc1 :threaded { ... } sub threadProc2 :threaded { ... } my $result1 = threadProc1( @args ); my $result2 = threadProc2( @otherArgs ); unless( defined $result1 && defined $result2 ) { ## continue doing other stuff while we wait } ## If the threads have completed by the time we reach this statement ## then the statements completes immediately. Otherwise it blocks unti +l ## both results are available. Effectively joining both threads. my $composite = $result1 + $result2;
    /li>

With a few simple control and data sharing constructs such as these, all the scary, difficult parts of threading disappear under the covers, and the programmer can get back to concentrating uponn their application whilst knowing that it will benefit from whatever core and threads are available.

You can imagine the pipeline example (read; decode; transform; encode; write), from above being written something along the lines of;

my $pipeline :promise = TQmap { open $out, '>', $outFile; write( $out, $_ ) while $Qin->dequeue; close $out; } 1, 10, TQmap { while( $Qin->dequeue ) { my $encoded = encode( $_ ); $Qout->enqueue( $encoded ); } } 2, 20, TQmap { while( $Qin->dequeue ) { my $transformed = transform( $_ ); $Qout->enqueue( $transformed ); } }, 4, 40, TQmap { while( $Qin->dequeue ) { my $decoded = decode( $_ ); $Qout->enqueue( $decoded ); } }, 2, 20, TQmap { ## read open my $in, '<'. $inFile; $Qout->enqueue( $_ ) while <$in>; close $in; }, 1, 10, undef; monitorKeyboard() until defined $pipeline; if( $pipeline != SUCCESS ) { log( error, $pipeline ); } else { log( success, $pipeline ); } exit;

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: (OT) Programming languages for multicore computers by BrowserUk
in thread (OT) Programming languages for multicore computers by citromatik

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.