Re: (OT) Programming languages for multicore computers

The following was started 2 days ago, but I've been unable to finish it. It was suggested to me I post it as-is anyway, and if there is sufficient interest, I'll try to complete it later.

My first response to almost every question you've asked, is: It depends.

It mostly depends upon the requirements of the algorithm and/or data that needs processing. There are essentially 3 basic types of parallelism (with many variations):

IO parallelism:
The simplest of them all. Mostly IO-bound.
Typified by webservers and the like. Each execution context is essentially independent of all others.
Each execution context spends most of its time waiting (a device or communications partner) for something to do. And when it does get something to do, it does it relatively quickly and then goes back to waiting.
Task parallelism:
Potentially the most complex. A mixture of IO and cpu.
Typified by the Google Map-Reduce architecture. (Though that is (currently) applied at the macro (box) level).
Each execution context runs a different algorithm on independent data, but they need to coordinate and synchronise between each other. Here the algorithms (tasks) are overlapped by passing the stream of data from one stage to the next in smallish chunks. Think of a pipeline where one stream of data needs to be processed by several different algorithms, but serially--read; decode; transform; encode; write.
Data parallelism:
Potentially the biggest gains. CPU-bound.
Typified by image (X-ray; MRI;etc.) processing.
One large homogeneous dataset needs a one or two, often simple algorithms applied to the entire dataset. Small subset of the dataset are assigned to each execution context perform the same algorithm(s) on.

This type lends itself to event driven/state machine/select methods.
The problems arise when the process also need to perform some cpu-intensive processing in addition to the IO-driven processing. This necessitates the programmer injecting artificial breaks into the cpu-intensive code in order to ensure timely servicing of the IO events.
Whilst not onerous for a single single program with exclusive use of the hardware, it becomes necessary to re-tune the software for each cpu; OS; end even application mix; that ths code will run on. making for costly porting and on-going maintenance.
As mentioned above, this type is currently being dealt with at the macro-level.
Using clusters of commodity hardware and disk-based 'pipelines' for shared storage. Whilst reasonably effective for the last, current, and possibly next generations of 1, 2, and 4-core commodity boxes with 2 to 4 GB of ram, the cost of all the disk-IO between stages will start to make itself felt as we move to larger numbers of cores and ram per box.
Once you have cores ranging from 8 to 32; multiplied by simultaneous threading per core of 2 to 8; and anything from 64 to 512GB per box, it makes less and less sense to use processes and disk-based storage to handle the pipelines.
When the machine can store an entire partition of the dataset entirely in memory, it is far quicker to load the dataset into memory once, and have each stage of the pipeline operate on it in place. You could do this by running each stage over the dataset serially, but as with the current scheme, handing over smallish chunks from stage to stage allows you to overlap the stages and vastly reduce the end-to-end processing time. So, running all the stages as threads, bucket-chaining over a single, large shared dataset is a natural evolution of the processes and disks model that brings huge efficiency saving for negligible extra infrastructure costs.
Just a single shared integer between each stage that is incremented by the previous stage and read-only to the following stage. Indicating how far the previous stage has made it through the dataset, and where the following stage continues it next cycle. Utilises only simple, two-party condition signalling with no possibility of dead-locks, priority inversions or any of those other scare-monger nasties.
Huge performance gains secured from avoiding inter-stage pipeline IO costs.
This type is already happening in games and Computer Generated Imagery.
It simply doesn't make sense to partition these kinds of datasets (images etc.) across processes, let alone boxes. The shared-memory model and threading, is the only way to go. But again, the "big problems" of the shared memory model--deadlocking; priority inversion etc.--do not arise, because the each thread operates exclusively on its own partition of the overall dataset.
By linearly or recursively partitioning the dataset, no locking is required and each thread is free to run full-speed on its part of the problem to completion. The only synchronisation involved is the master thread waiting for the workers to finish before it does whatever needs doing with the results.
More and more, this kind of processing will be offloaded to specialist CPUs (dsps; gpus)(hereafter SPUs), for processing. However, with current setups this involves the transfer of data from cpu memory to SPUs memory and back again. And with potentially multiple processes needing the SPU's services, and with memory access already the bottleneck on performance, it will make more sense in the near future for Spus to do their thing by directly accessing the data in the main memory. Effectively making the SPUs cores extensions of the main cpu. We're already seeing this with SIMD and similar instruction sets.

The future is threaded. We are just waiting for the software to catch up with the hardware. And I fear we are going to have to wait for the next generation of programmers before that problem will be properly fixed.

Just as many of my generation have problems with using SMS--and with the language forms it generates--whilst it is simpler than tying shoelaces for the new generations. So it is with threading. Many in my generation only see the problems involved--not the potential.

Just as it tool the ubiquity of the web to drive the transition from th request-response model to the upcoming Web 2 era. So, it will require the ubiquity of threads and cores to drive the transition from forks&pipes to threads&shared state.It may take until the retirement of the Luddite generation for it to happen. But it will happen.

As you've rightly indicated, one of the primary drivers of the transition will be the evolution of computer languages that give easy--and well abstracted--access to the potential. Whilst many of those you cite are a adept at one or more of the types of concurrency, none of them are truly adept at all of them. And problems arise because real-world problems tend to require two or more of those types in the same application.

A secondary problem with most existing language abstractions of concurrency is that they tend to take one of two tacts:

A low-level approach--typified by C/C++ and libraries like Intel Threading Building Blocks--whereby the programmer is given access to, and required to exercise, full control over all aspects of the threading.
This not just enables, but forces the programmer to deal with all the issues of sharing state, not just for that data that needs to be shared, but all state! And that places the onus upon the programmer to ensure the segregation of per stage (per thread) state. A time consuming and rigorous task much better suited to automation by the compiler.
A high-level, encapsulated approach--typified by most of the concurrent FP languages like Haskell and Erlang--which removes most of the control from the programmers hands.
The problem here is that the FP (and even const correct procedural) languages prevent the programmer (at least without resorting to extraordinary procedures (with often "dangerous" sounding names)), from sharing state for those data that need to be shared, with the result that the data has to be needlessly and expensively copied, often many times, as it transitions through multi-stage pipelined processes; or as multiple workers concurrently do the same processing on their own small chunks of the total dataset, and then they are 're-assembled' back to a whole.
This compiler enforced (and needless) replication of state becomes a huge drain upon system resources via memory thrash, IO and communications protocols overheads. This can turn O(n) algorithms in to O(n^2) algorithms (or worse), once those overheads are factored into the equations. That detracts from, and can even completely negate, the benefits of concurrency. Add in the extra costs of concurrent development, maintenance and testing, and it is easy to see why people see little to be gained from the concurrency.

The solution is relatively simple, in concept at least. There needs to be a clear delineation between that state that needs to be shared--for efficient concurrency. And that state that mustn't be shared--for algorithmic safety. And the programmer must have explicit control over the former, whilst the compiler ensures and enforces the latter. In this way, thread procedures can be written without regard to the safety of local variables and state. Anything declared 'local', or 'non-shared', or better, any not explicitly marked as shared, is protected through compiler-enforced mechanisms from leaking between threads. At the same time, shared state can be passed around between threads as the needs of the algorithm dictate, without incurring huge penalties of copying involved in write-once variables.

Another way of viewing the two concurrency abstractions is:

the former seeks to give the programmer a zillion tools for controlling access to shared state and synchronising control flows between threads.
The problem here, beyond the well-known difficulties of getting this stuff right, is that the exposure and predominance of these mechanisms actively encourages programmers--especially those new to concurrency--to design their algorithms around utilising locks and synchronisation.
whilst the latter seeks to hide all such control within the abstraction beyond the programmers reach.
With this approach, you end up with either weird and unintuitive programming constructs and paradigms; or huge, unwieldy and slow compile-time analysis engines; or belt&braces, copy-everything and lock-everything always--"whether it needs it or not"--infra-structure and library code.

There is a third answer to the problems of locking and synchronisation. Avoid them! Sounds too simple to be a point worth making, but it is a fact that most algorithms and applications that can benefit from concurrency can, with a little care, be architected such that they use little or no locking or synchronisation, despite that they may manipulate large volumes of shared state, in place. The trick is to program the application so that only one thread attempts to access any given piece of data an any given time. I'm not talking about using mutexs to prevent concurrent access. More, mechanisms that don't allow the programmer to try. But without throwing the (procedural) baby out with the bath water.

This can be done today--in any language. It just requires the programmer to be aware of the techniques and apply them. But it would be a whole lot easier if programmers didn't have to become aware of the implementation details or re-invent them every time. To that end, there are several new abstractions that can help:

Memory delegates- handles to (subsections of) real shared memory, that can be passed between threads, but which can only be used by the thread holding the delegate. Old copies of the delegate become disabled. Runtime enforcement is fatal. Compile-time detection easy.
Thread-controlled shared variables:
This variable is shared--by threads A & B (& F ...) ONLY!
Access type controls on shared variables:
This variable is shared between threads A & B, but only thread A can write to it and only thread B can read from it. And B cannot read from it until A has written it. And thread A cannot write to it again until thread B had read it.
Bi-directional latching shared variables.
Variable shared between threads A & B, readable and writable by both; but cannot read until A has written; and once B has read, neither can read (and A cannot write again) until B has written; now only A can read and neither can read again (and B cannot write again), until A has written.
And so you encapsulate the request-response communications protocols in a single variable.
Self limiting queues:
Like thread controlled variables, which threads have access (and what type of access) is controlled, and both compile-time and run-time enforced. But in addition, the queues limit their own sizes such that any attempt to write when the queue is 'full' causes the writer to block. Any attempt to read when the queue is empty causes the reader to block.
You get self-regulating synchronisation with tunable buffering.
Declarative threading of functions for 'fire and forget' usage.

Delayed (lazy) results from threads (promises).

Instead of having explicitly create a thread passing the thread procedure address; retrieve the handle; and then later, explicitly call a blocking join to get the results. Instead, the above two combine to give something like:

sub threadProc1 :threaded { ... }
sub threadProc2 :threaded { ... }


my $result1 = threadProc1( @args );
my $result2 = threadProc2( @otherArgs );

unless( defined $result1 && defined $result2 ) {
    ## continue doing other stuff while we wait
}

## If the threads have completed by the time we reach this statement
## then the statements completes immediately. Otherwise it blocks unti
+l
## both results are available. Effectively joining both threads.
my $composite = $result1 + $result2;
[download]

/li>

With a few simple control and data sharing constructs such as these, all the scary, difficult parts of threading disappear under the covers, and the programmer can get back to concentrating uponn their application whilst knowing that it will benefit from whatever core and threads are available.

You can imagine the pipeline example (read; decode; transform; encode; write), from above being written something along the lines of;


my $pipeline :promise = TQmap {
    open $out, '>', $outFile;
    write( $out, $_ ) while $Qin->dequeue;
    close $out;
} 1, 10, TQmap {
    while( $Qin->dequeue ) {
        my $encoded = encode( $_ ); 
        $Qout->enqueue( $encoded );
    }
} 2, 20, TQmap {
    while( $Qin->dequeue ) {
        my $transformed = transform( $_ ); 
        $Qout->enqueue( $transformed );
    }
}, 4, 40, TQmap {
    while( $Qin->dequeue ) {
        my $decoded = decode( $_ );
        $Qout->enqueue( $decoded );
    }
}, 2, 20, TQmap { 
    ## read
    open my $in, '<'. $inFile;
    $Qout->enqueue( $_ ) while <$in>;
    close $in;
}, 1, 10, undef;

monitorKeyboard() until defined $pipeline;

if( $pipeline != SUCCESS ) {
    log( error, $pipeline );
}
else {
    log( success, $pipeline );
}

exit;
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re: (OT) Programming languages for multicore computers Select or Download Code

Replies are listed 'Best First'.
Re^2: (OT) Programming languages for multicore computers by tilly (Archbishop) on May 07, 2009 at 04:59 UTC
This is a complicated problem. However I do not believe that multi-threading is a real solution for reasons that I explained on another forum some time ago. The short version of what I said there is that Moore's law is looking to add cores faster than SMP can scale, at which point we'll go to NUMA inside of a CPU. And once we go to NUMA, multi-threaded programs don't scale. So while there are some very important niches which can benefit from serious multi-threading, I don't buy the dream of unlimited numbers of cores available to a single multi-threaded application.	[reply]
Re^3: (OT) Programming languages for multicore computers by BrowserUk (Patriarch) on May 07, 2009 at 14:12 UTC
Thank you. That's an interesting article to read. But, like so many articles that set out to prove the point they want to make, rather than looking at the evidence and see where it leads, you spend much of the time setting up Straw Men, and then proceed to knock them down. The first mistake you make, (in the chronology of the article), is exactly the "big problem with threading" that my post above took special pains to refute--but you probably didn't read down that far--that of "synchronisation & locking". The whole paragraph starting "There are some basic trade-offs to understand.", is just an elaborately constructed straw man. Synchronisation & locking is only a problem, if you design your algorithms to need synchronisation and locking. So, don't do that! Without repeating all the carefully constructed explanations above: Why would anyone architect an algorithm to use 32k cores, so that they all needed to access the same piece of memory? It beggar's belief to think that anyone would, and the simple truth is "they" wouldn't. There is no possible algorithm that would require 32k threads to contend for a single piece of memory. None. Let's just say for now(), that we have a 32K core processor on our desktop, and we want to filter a modest 1024x1024 image. Do we apply a lock to the entire image and set 32k threads going to contend for it? Or do we create one lock for each individual pixel? And the answer to both is a resounding: No! Why would we. We simply partition the 1M pixels into 32K lots of 32, and give one partition to each of the 32k threads. Viola! The entire 1MP image filtered in the same time it takes one thread to filter a 32-pixel image. No locks. No contention. And the only "synchronisation" required is the master thread that waits for them to all finish. One load from disk. One store back to disk. As a certain annoying TV ad here in the UK has it: simples! I don't really believe that any of use will have anything like 32k cores on our desktops in the next 20 years, but I'm just going with your numbers. See below.) You simply cannot do that anywhere near as efficiently with processes; nor message passing models; nor write-once storage models. The second mistake you make is that you mutate Moore's Law: "the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years", into: "we're doubling the number of cores.". And that simply isn't the case. You can do many things with those transistors. Doubling the number of cores is only one possibility--and one that no manufacturer has yet taken in any two successive Moore's cycles! Other things you can do are: Increase the width of the address bus. 8-bit; 16-bit; 32-bit; 64-bit. We probably won't need 128-bit addressing any time soon... Increase the width of the data paths. But 128-bit data-paths. This is already happening with SIMD instruction sets; GPUs and other SPUs. Increase the size of on-dye memory (cache). We started with 2k caches; these are now at 2 MB for L3 cache on some Nehalem processors. Increase the number of levels of on-dye caching. We started with one level; we now have three levels. What's the betting we'll see L4 caching at the next die-shrink 'tick'? Add on-dye high-speed core-to-core data-paths to replace the Front Side Bus. AMD did it a while back with their Hypertransport. Intel has just played catch-up with Nehalem and their QuickPath Interconnect. Add/increase the number of register files. Duplicate sets of registers that can be switched with a single instruction. They've been doing this on embedded processors for decades. Implement NUMA in hardware. Look at the architecture of the Cell processor. By using one PPE to oversee 8 SPEs, and giving each SPE its own local memory as well as access to the PPE larger on-chip caches, you effectively have an 8-way NUMA architecture on a chip. In summary, relating modern CPU architectures to the OS techniques developed for the IO-bound supercomputer architectures of the 1980s and '90s just sets up another Straw Man that is easily knocked down. But it doesn't have any relevance to the short-term (2-5 years) or medium-term (5-10) futures. Much less the long-term 20 year time frame over which you projected your numbers game. Making predictions for what will be in 20 years time in this industry is always fraught; but I'll have a go below(). Your third mistake, albeit that it is just a variation on 2 above, is when you say: "On a NUMA system there is a big distinction between remote and local memory.". Your assumption is that NUMA has to be implemented in software, at the OS level, and sit above discrete cores. As I've pointed out above, IBM et al. (STI) have already implemented (a form of) NUMA in hardware, and it sits subordinate to the OS. In this way, the independence (contention-free status) of the cores running under NUMA is achieved, whilst they each operate upon sub-divisions of memory that come from a single, threaded-process at the OS level. This effectively inverts the "local" and "remote" status of memory as compared with convention NUMA clusters. Whilst the local memory has to be kept in synchronisation with the remote memory, the contention is nominal as the contents of each local memory is simply caching a copy of a small section of the remote (main) memory. Just as L3 caching does within current processors. And as each local cached copy is of an entirely different sub-division of the main memory, there is no contention. (*) My prediction for the medium-term 10 year, future (I chickened out of the long-term :), is that each application will be a single, multi-threaded process running within its own virtualised OS image. And at the hypervisor level, the task will be to simply distribute the threads and physical memory amongst the cores; bringing more cores on-line, or switching them off, as the workloads vary. This means that the threads of a particular VM/process/application will tend to cluster over a limited number of the available cores, controlled by dynamic affinity algorithms, but with the potential to distribute any single task's threads over the full complement of available cores should the need arise, and appropriate algorithm be chosen. But that's about as far as I'm prepared to go. Time will tell for sure :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: (OT) Programming languages for multicore computers by tilly (Archbishop) on May 07, 2009 at 15:42 UTC
The first question is how fast cores will increase. I had naively said it would continue at the same rate as Moore's law. When I look at it carefully that was not quite right. However how wrong is it? Intel's first true dual core chip was released April, 2005. AMD's was released in May, 2009. Today we have shipping 4 core processors. However Intel is promising that by the end of this year there will be 6-core and 8-core Nehalem processors shipping. That processor is reportedly designed for even more cores than that. 8-core represents 2 doublings over the initial technology, and will likely have shipped 4.5 years later. Let's round that up to 5 years. Let's project that over the next 30 years. In 2010 we'll have 8-cores. If development continues in 2015 we'll have 32-cores. In 2020 we'll have 256-cores. In 2025 we'll have 1K cores. In 2030 we'll have 4K cores. In 2035 we'll have 16K cores. And in 2040 we'll have 64K cores. That's based on 5 year quadrupling time, which is below the currently measured rate. Based on what I know of operating systems, there will be no obvious problems scaling operating systems in the next 5 years - Windows and Linux both scale to 32 CPUs fairly well. Operating systems will appear to scale for the next 5 years, albeit with diminishing returns. The real challenges come in the following 5 years. Given how slow our profession is to notice the obvious, it may take 20 years for it to become widely understood that massively multi-threaded systems don't scale that well. And when that happens I'll be able to say, "I knew that ages ago!" Now let's discuss my "straw man". When I was talking about scalability limits, I was not talking about the scalability limits of the algorithms, but instead about the scalability limits of conventional operating systems. Which needs to synchronize CPUs for various reasons, such as sorting out the routing of I/O. Both between processes and with the outside world. Moving forward, you raise an excellent point about the architecture of the Cell processor. I am not familiar with that architecture, but looking at it you do effectively get NUMA while hiding the details from the operating system. However straightforwardly multi-threaded programs are not going to be magically exempt from scalability issues due to the overhead of memory synchronization. When 2 threads that are remote from each other both access a piece of memory which needs to look consistent between them, there has to be locking under the hood at some level to make that happen. With traditional multi-threaded programs, there are no guarantees on any part of memory being shared, and this leads to complications. With pipelined data as in your previous suggestion, you've controlled what sections of memory need to be locked, and how that locking needs to happen. About the only guarantee for the future is that the world will be interesting as people figure all of this out. Update: I fixed my projections into the future. I also said when I thought people would notice these issues.	[reply]
Re^5: (OT) Programming languages for multicore computers by BrowserUk (Patriarch) on May 07, 2009 at 23:40 UTC
Re^2: (OT) Programming languages for multicore computers by Gavin (Archbishop) on May 06, 2009 at 18:56 UTC
Very informative and enjoyable read, even if a lot went over my head ++ BrowserUk. I do hope you find the time to complete it, perhaps it would then make a good Meditation.	[reply]


We don't bite newbies here... much
	PerlMonks