Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^3: (OT) Programming languages for multicore computers

by BrowserUk (Patriarch)
on May 07, 2009 at 14:12 UTC ( [id://762615]=note: print w/replies, xml ) Need Help??


in reply to Re^2: (OT) Programming languages for multicore computers
in thread (OT) Programming languages for multicore computers

Thank you. That's an interesting article to read.

But, like so many articles that set out to prove the point they want to make, rather than looking at the evidence and see where it leads, you spend much of the time setting up Straw Men, and then proceed to knock them down.

  1. The first mistake you make, (in the chronology of the article), is exactly the "big problem with threading" that my post above took special pains to refute--but you probably didn't read down that far--that of "synchronisation & locking".

    The whole paragraph starting "There are some basic trade-offs to understand.", is just an elaborately constructed straw man. Synchronisation & locking is only a problem, if you design your algorithms to need synchronisation and locking. So, don't do that!

    Without repeating all the carefully constructed explanations above: Why would anyone architect an algorithm to use 32k cores, so that they all needed to access the same piece of memory? It beggar's belief to think that anyone would, and the simple truth is "they" wouldn't. There is no possible algorithm that would require 32k threads to contend for a single piece of memory. None.

    Let's just say for now(*), that we have a 32K core processor on our desktop, and we want to filter a modest 1024x1024 image. Do we apply a lock to the entire image and set 32k threads going to contend for it? Or do we create one lock for each individual pixel? And the answer to both is a resounding: No! Why would we. We simply partition the 1M pixels into 32K lots of 32, and give one partition to each of the 32k threads.

    Viola! The entire 1MP image filtered in the same time it takes one thread to filter a 32-pixel image. No locks. No contention. And the only "synchronisation" required is the master thread that waits for them to all finish. One load from disk. One store back to disk. As a certain annoying TV ad here in the UK has it: simples!

    I don't really believe that any of use will have anything like 32k cores on our desktops in the next 20 years, but I'm just going with your numbers. See below.)

    You simply cannot do that anywhere near as efficiently with processes; nor message passing models; nor write-once storage models.

  2. The second mistake you make is that you mutate Moore's Law: "the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years", into: "we're doubling the number of cores.". And that simply isn't the case.

    You can do many things with those transistors. Doubling the number of cores is only one possibility--and one that no manufacturer has yet taken in any two successive Moore's cycles! Other things you can do are:

    • Increase the width of the address bus.

      8-bit; 16-bit; 32-bit; 64-bit. We probably won't need 128-bit addressing any time soon...

    • Increase the width of the data paths.

      But 128-bit data-paths. This is already happening with SIMD instruction sets; GPUs and other SPUs.

    • Increase the size of on-dye memory (cache).

      We started with 2k caches; these are now at 2 MB for L3 cache on some Nehalem processors.

    • Increase the number of levels of on-dye caching.

      We started with one level; we now have three levels. What's the betting we'll see L4 caching at the next die-shrink 'tick'?

    • Add on-dye high-speed core-to-core data-paths to replace the Front Side Bus.

      AMD did it a while back with their Hypertransport. Intel has just played catch-up with Nehalem and their QuickPath Interconnect.

    • Add/increase the number of register files. Duplicate sets of registers that can be switched with a single instruction.

      They've been doing this on embedded processors for decades.

    • Implement NUMA in hardware.

      Look at the architecture of the Cell processor. By using one PPE to oversee 8 SPEs, and giving each SPE its own local memory as well as access to the PPE larger on-chip caches, you effectively have an 8-way NUMA architecture on a chip.

    In summary, relating modern CPU architectures to the OS techniques developed for the IO-bound supercomputer architectures of the 1980s and '90s just sets up another Straw Man that is easily knocked down. But it doesn't have any relevance to the short-term (2-5 years) or medium-term (5-10) futures. Much less the long-term 20 year time frame over which you projected your numbers game.

    Making predictions for what will be in 20 years time in this industry is always fraught; but I'll have a go below(**).

  3. Your third mistake, albeit that it is just a variation on 2 above, is when you say: "On a NUMA system there is a big distinction between remote and local memory.".

    Your assumption is that NUMA has to be implemented in software, at the OS level, and sit above discrete cores.

    As I've pointed out above, IBM et al. (STI) have already implemented (a form of) NUMA in hardware, and it sits subordinate to the OS. In this way, the independence (contention-free status) of the cores running under NUMA is achieved, whilst they each operate upon sub-divisions of memory that come from a single, threaded-process at the OS level. This effectively inverts the "local" and "remote" status of memory as compared with convention NUMA clusters.

    Whilst the local memory has to be kept in synchronisation with the remote memory, the contention is nominal as the contents of each local memory is simply caching a copy of a small section of the remote (main) memory. Just as L3 caching does within current processors. And as each local cached copy is of an entirely different sub-division of the main memory, there is no contention.

(**) My prediction for the medium-term 10 year, future (I chickened out of the long-term :), is that each application will be a single, multi-threaded process running within its own virtualised OS image. And at the hypervisor level, the task will be to simply distribute the threads and physical memory amongst the cores; bringing more cores on-line, or switching them off, as the workloads vary. This means that the threads of a particular VM/process/application will tend to cluster over a limited number of the available cores, controlled by dynamic affinity algorithms, but with the potential to distribute any single task's threads over the full complement of available cores should the need arise, and appropriate algorithm be chosen.

But that's about as far as I'm prepared to go. Time will tell for sure :)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^3: (OT) Programming languages for multicore computers

Replies are listed 'Best First'.
Re^4: (OT) Programming languages for multicore computers
by tilly (Archbishop) on May 07, 2009 at 15:42 UTC
    The first question is how fast cores will increase. I had naively said it would continue at the same rate as Moore's law. When I look at it carefully that was not quite right. However how wrong is it? Intel's first true dual core chip was released April, 2005. AMD's was released in May, 2009. Today we have shipping 4 core processors. However Intel is promising that by the end of this year there will be 6-core and 8-core Nehalem processors shipping. That processor is reportedly designed for even more cores than that. 8-core represents 2 doublings over the initial technology, and will likely have shipped 4.5 years later. Let's round that up to 5 years.

    Let's project that over the next 30 years. In 2010 we'll have 8-cores. If development continues in 2015 we'll have 32-cores. In 2020 we'll have 256-cores. In 2025 we'll have 1K cores. In 2030 we'll have 4K cores. In 2035 we'll have 16K cores. And in 2040 we'll have 64K cores. That's based on 5 year quadrupling time, which is below the currently measured rate.

    Based on what I know of operating systems, there will be no obvious problems scaling operating systems in the next 5 years - Windows and Linux both scale to 32 CPUs fairly well. Operating systems will appear to scale for the next 5 years, albeit with diminishing returns. The real challenges come in the following 5 years. Given how slow our profession is to notice the obvious, it may take 20 years for it to become widely understood that massively multi-threaded systems don't scale that well. And when that happens I'll be able to say, "I knew that ages ago!"

    Now let's discuss my "straw man". When I was talking about scalability limits, I was not talking about the scalability limits of the algorithms, but instead about the scalability limits of conventional operating systems. Which needs to synchronize CPUs for various reasons, such as sorting out the routing of I/O. Both between processes and with the outside world.

    Moving forward, you raise an excellent point about the architecture of the Cell processor. I am not familiar with that architecture, but looking at it you do effectively get NUMA while hiding the details from the operating system.

    However straightforwardly multi-threaded programs are not going to be magically exempt from scalability issues due to the overhead of memory synchronization. When 2 threads that are remote from each other both access a piece of memory which needs to look consistent between them, there has to be locking under the hood at some level to make that happen. With traditional multi-threaded programs, there are no guarantees on any part of memory being shared, and this leads to complications. With pipelined data as in your previous suggestion, you've controlled what sections of memory need to be locked, and how that locking needs to happen.

    About the only guarantee for the future is that the world will be interesting as people figure all of this out.

    Update: I fixed my projections into the future. I also said when I thought people would notice these issues.

      Intel's first true dual core chip was released April, 2005. AMD's was released in May, 2009.

      Hm. Your dates are wrong. AMD Athlon-64 X2 dual-core was released on May 31, 2005. And their Phonem X3 and X4 in 2007.

      But that belies that both companies were integrating FPUs onto the dies way back when (486 circa. 1989).

      And they've been utilising the extra transistors that have become available through process shrink (eg. 65nm to 45nm to 32nm), for all sorts of things. The number of levels and sizes of on-chip caches over the last many years. And most recently as I mentioned above, high-speed, low-latency, point-to-point, serial/parallel bus technologies--effectively an on-chip, high speed LAN.

      Also, for the last few years Intel have followed their well-known tick-tock strategy. On the tick cycle, they do a process shrink on the current micro-architecture. On the tock-cycle, they release a new micro-architecture. With Nehalem being the latest tock, the next cycle will be another process shrink during which they'll be looking to increase clock rates and/or reduce power consumption.

      (And that's another use for the extra transistors I forgot to mention; the incorporation of stage-by-stage clock-frequency and shut-down circuitry allowing them to slow-down or switch-off large chunks of the chips independently to conserve power.)

      Take all those factors into account and I think you'll see that your no-of-cores projections, even your revised numbers, are still way optimistic. I think we might reach 32-core processors at the high-end over the next 5 to 6 years. Just. But it'll take another cycle at least for them to reach the commodity box. And push back the 1k-cores to the 2030 time frame.

      Windows and Linux both scale to 32 CPUs fairly well. Operating systems will appear to scale for the next 5 years, albeit with diminishing returns. The real challenges come in the following 5 years.

      Hm. I cannot talk for Linux, but talking about "the Windows OS" in singular terms is a mistake.

      Vista is a significantly different animal to XP. And I'm not talking about Aero or similar cosmetic differences. I mean deep down in the guts of the kernel, where for example, the scheduler that previously used fixed time-based quanta as the basis of the round-robin scheduling, now uses instruction cycle counting to more fairly apportion the CPU(s).

      Then there are beast like Windows HPC, which use radically different kernel structures than the desktop, or even general purpose Windows Server variants.

      And Windows 7 has already incorporated VM technology--albiet for somewhat dubious reasons--into a main stream, desktop product. And everyday there is a new announcement by one of the various players in the Virtual Server marketplace of their latest greatest twist on the theme.

      I occasionally need to access websites that simply do not work unless Active-X is enabled. Previously, I would totally avoid such sites, but these days, I have a copy of XP installed in a VirtualPC with neither LAN nor real harddisk access configured specifically for browsing this kind of website. A virus scanner runs over the relatively small virtual harddisk each time I boot the image--and if it ever detects anything it cannot remove, I just dump the Virtual disk and deploy a copy. Effectively it is a single application virtual OS. It's not hard to see how this technique will be integrated deep into Windows 8.

      (I'm aware that the better supported Linux distributions come in several flavours--I'm just not equipped to comment on them>)

      When I was talking about scalability limits, I was not talking about the scalability limits of the algorithms, but instead about the scalability limits of conventional operating systems

      Basically what I'm saying is that it is naive to predict the future on the basis of hyper-evolution of the hardware, whilst simultaneously having software development stagnate. There are already strong indications of the ways in which OSs are going to evolve. And hypervisors--with all their abilities to segregate and route disk and network IO; even that originating from multiple concurrent OSs--are beginning to to move under the OS, closer to the hardware.

      In addition, if we can have SPUs for FP, graphics and audio, why not dedicate one core per chip to IO? You'll remember the little mention 80186/80188 variants of the x86 architecture, with their specialist IO (ins/outs) instructions and extra IO dedicated hardware. Throw one of those on the chip as one of the 8 cores and give it sole responsibility for external IO. Have it reading and writing directly to physical memory pages in that are in a range of the 64-bit physical address space dedicated to that purpose. But map those physical pages into the virtual address spaces of the processes doing the IO. Have dedicated IO threads in each process. You've now removed DMA contention between the GPUs and main memory and the IOPU and the IO channels. This is not a new idea. Mainframes had dedicated IO processors offloading that workload from the main CPUs decades ago. Eg. The IBM 3174 and 3274 Control Units for the 370 processors.

      You cannot achieve that address bus contention reduction using single threaded processes!

      However straightforwardly multi-threaded programs are not going to be magically exempt from scalability issues...

      I never said they would be. But why do we have to confine ourselves to "straightforwardly multi-threaded programs"?

      When 2 threads that are remote from each other both access a piece of memory which needs to look consistent between them, there has to be locking under the hood at some level to make that happen.

      Firstly, that "When", should be "If". Whilst there might be large volumes of notionally "shared state" within a threaded application, with the right algorithms and partitioning, very little of it actually needs to be accessed concurrently by multiple threads.

      Even at the OS level, shared state tends to be used serially. Take IO buffers as an example. First the process writes the buffer--it then transfers it to the OS for output; or the OS fills a buffer from disk or network and then hands it over to the process.

      And existing OSs are already adept at managing the consistency of duplicate memory images--think cache coherency--mostly using hardware assist. There's no reason that similar mechanisms cannot be used to protect a process' per-thread storage, one from the other.

      With traditional multi-threaded programs, there are no guarantees on any part of memory being shared, and this leads to complications.

      The second main theme of my original post, was dedicated to explaining how languages need to evolve to support the programmer in the use of threading. And that the primary way in which they need to evolve, is to provide compiler and runtime enforced (possibly with hardware assist, clear delineation between thread-local, non-shared variables, (in Perl's terms "lexicals"), and explicitly shared global variables.

      threads already does this, but then screws it all up by insisting on automatically cloning vast chunks of stuff whether you need it or not. Haskell and other FP languages, by their very nature, already ensure that local variables are 'thread-safe'. But in the process they either take away access to shared state completely, or over-complicate its use through dogmatic design.

      It's not that hard to see the possibility for a language to be designed from the ground up with threading in mind, to provide the best of both worlds: compiler-enforced, non-shared, thread local variables spaces and explicitly programmer controlled, explicitly-shared, global variable spaces.

      Combine those two things with some of the new threading constructs I outlined, along with algorithms explicitly adapted to avoid locking and synchronisation, and you end up with multi-threaded programming that is no more complex than are currently quite common place in process based concurrency--ie. file sharing and locking; DB transactions and the like. With all the benefits of in-process shared state accessibility and performance.

      it may take 20 years for it to become widely understood that massively multi-threaded systems don't scale that well. And when that happens I'll be able to say, "I knew that ages ago!"

      I assume that like me, you are just an informed observer with no special insights to what the hardware fabs or software labs have on their drawing boards. We are both trying to look at history, and the current state of things, and use our experiences to date to predict where things are going to go. We differ in our conclusions, but that may be influenced by our differing backgrounds. Maybe the reality will be somewhere between us.

      One thing I do hope is that this place will still be around so that we can come back here and see how close--or not--we got; in 5, 10, 15 & 20 years time. Flesh willing.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://762615]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-19 23:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found