I'm probably in a very small minority but I think that the 'free ride' we got fr...

JoshTriplett · on Dec 18, 2011

Moore's Law still applies today; the number of transistors continues to double every couple of years. However, the transistors no longer go towards making clock speeds faster; instead, we get more parallel cores, and more impressive instruction sets.

And I agree that we've taken too little time to look at parallelism. In particular, our current approaches to concurrency (such as mutual exclusion or transactions) don't actually provide much in the way of useful parallelism; they just serialize bits of the code that access shared data. We don't actually have many good algorithms that allow concurrent access to the same data, just concurrent access to different data.

Currently writing my dissertation on exactly this topic: a generalized construction of scalable concurrent data structures, which allows concurrent access to the same data from multiple threads without requiring expensive synchronization instructions.

JulianMorrison · on Dec 18, 2011

Clojure's data structures are rather good for "concurrent access to the same data". References change value, but values don't change. The references are protected by read/write locks but the values in them need no protection once they have been extracted and while they are being used.

tezza · on Dec 18, 2011

I think we would just have run into Amdahl's Law earlier.

http://en.wikipedia.org/wiki/Amdahl%27s_law

.

Basically you don't get a free pass for parallelising things.

.

If 90% of a task is parallelizable then the maximum speedup you can get with infinite cores is 10x

jacquesm · on Dec 18, 2011

You're right, there is no free pass.

Let me give you one example of how the wrong habits have become embedded in how we think about programming because of the serial nature of former (and most current) architectures:

One of the most elementary constructs that we use is the loop. Add four numbers: set some variable to 0, add in the first, the second, the third and finally the fourth. Return the value to the caller. The example is short because otherwise it would be large, repetitive text. But you can imagine easily what it would look like if you had to add 100 numbers in this way, and 'adding numbers' is just a very dumb stand-in for a more complicated reduce operation.

That 'section' could be called critical and you'd be stuck at not being able to optimize that any further.

But on a parallel architecture that worked seamlessly with some programming language what could be happening under the hood is (first+second) added to (third+fourth). You can expand that to any length list and the time taken would be less than what you'd expect based on the trivial sequential example.

Right now you'd have to code that up manually, and you'd have to start two 'threads' or some other high level parallel construct in order to get the work done.

As far as I know there is no hardware, compiler, operating system or other architectural component that you could use to parallelize at this level. The bottle-neck is the overhead of setting things up , which should be negligible with respect to the work done. So parallelism would have to be built in to the fabric of the underlying hardware with extremely low overhead from the language/programmers point of view in order to be able to use it in situations like the one sketched above.

Those non-parallelizable(sp?) segments might end up a lot shorter once you're able to invoke parallel constructs at that lower level.

mietek · on Dec 18, 2011

What you've described has been available for a while in Haskell[1]. These tiny, parallelizable units of computation are called "sparks", and are spawned off as easily as "par x y".

[1]: http://www.haskell.org/ghc/docs/7.0.3/html/users_guide/lang-...

jacquesm · on Dec 18, 2011

That's really neat, what hardware supports such 'sparks' without huge overhead?

mietek · on Dec 18, 2011

No special hardware is required — it's all in the runtime. Quoting from Parallel Haskell Digest[1]:

"""You may have heard of sparks and threads in the same sentence. What's the difference?

A Haskell thread is a thread of execution for IO code. Multiple Haskell threads can execute IO code concurrently and they can communicate using shared mutable variables and channels.

Sparks are specific to parallel Haskell. Abstractly, a spark is a pure computation which may be evaluated in parallel. Sparks are introduced with the par combinator; the expression (x par y) "sparks off" x, telling the runtime that it may evaluate the value of x in parallel to other work. Whether or not a spark is evaluated in parallel with other computations, or other Haskell IO threads, depends on what your hardware supports and on how your program is written. Sparks are put in a work queue and when a CPU core is idle, it can execute a spark by taking one from the work queue and evaluating it.

On a multi-core machine, both threads and sparks can be used to achieve parallelism. Threads give you concurrent, non-deterministic parallelism, while sparks give you pure deterministic parallelism.

Haskell threads are ideal for applications like network servers where you need to do lots of I/O and using concurrency fits the nature of the problem. Sparks are ideal for speeding up pure calculations where adding non-deterministic concurrency would just make things more complicated."""

[1]: http://www.well-typed.com/blog/52

jacquesm · on Dec 18, 2011

Right, but at some level if you want the OS to have multiple cores active within the same process then you'll have to spawn multiple threads. So I'm wondering if these threads are pre-spawned and assigned from some kind of pool (relatively low overhead) or if they're started 'on demand' or some mechanism like that. Thank you very much for the pointer, I'll read up on this, it is definitely a very interesting development.

mietek · on Dec 18, 2011

Indeed, there is a pool of OS threads, a pool of Haskell lightweight threads, and a spark pool. Check out these papers:

* Runtime Support for Multicore Haskell[1]

* Seq no more: Better Strategies for Parallel Haskell[2]

[1]: http://community.haskell.org/~simonmar/papers/multicore-ghc....

[2]: http://community.haskell.org/~simonmar/papers/strategies.pdf

SoftwareMaven · on Dec 18, 2011

That would be very similar to Objective C's asynchronous blocks. The system manages a queue of blocks and a set of threads to schedule those blocks in.

I think (and I recognize I'm not a theorist) that the synchronous aspect of the queue management is still going to hurt for the original "add numbers in a loop" case until we get better hardware support for them. Is it better than spawning a thread? Of course. But it still isn't ideal.

tezza · on Dec 18, 2011

I'm not poo-pooing improving the ability get 10x speedups... that'd be awesome in many contexts.

Rather I think a lot of even "embarrassingly parallel" problems have lots of micro-sequential code within.

With your example there are mico-sequential portions for allocating the processors for the split addition to run on, and then the sequential bit of adding the resultants back together again.

jacquesm · on Dec 18, 2011

Those micro-sequential portions are right now lumped into one huge 'can't fix', to be able to exploit parallism at that level would make huge speedups possible.

After all, Amdahl's law is 'bad' when you're looking at a 20% segment that you can't improve, but as you get closer to 100% the pay-offs of even small optimizations becomes larger and larger.

alanning · on Dec 18, 2011

Similar to mietek's Haskell example, .NET 4.0 offers an easy way to parallelize for loops using the Parallel.For method in System.Threading.Tasks [1].

  Parallel.For(0, N, i =>
  {          
      // Do Work.
  });

This link has a simple performance comparison: [2]

[1] http://msdn.microsoft.com/en-us/library/system.threading.tas...

[2] http://www.lovethedot.net/2010/02/parallel-programming-in-ne...

algoshift · on Dec 18, 2011

FPGA's is where parallel processing is alive and well.

I've done a lot of real-time image processing work on FPGA's. The performance gains one can create are monumental. One of my favorite examples (one that parallels yours) is the implementation of an FIR (Finite Impulse Response) filter --a common tool in image processing.

How it works: Data from n pixels is multiplied by an equal number of coefficients, the results are added and then divided by n. There are more complex forms (polyphase) but the basics still apply.

Say n=32. You multiply 32 pixel values by 32 coefficients. Then you sum all 32 results, typically using a number of stages where you sum pairs of values. For 32 values you need five summation stages (32 values -> 16 -> 8 -> 4 -> 2 -> result). After that you divide and the job is done.

Due to pipelining this means that you are processing pixel data through the FIR filter at the operating clock rate. The first result takes 8 to 10 (or more) clock cycles to come out of the pipe. After that the results come out at a rate of one per clock cycle.

If the FIR filter is running at 500MHz you are processing five-hundred million pixels per second. Multiply that by 3 to process RGB images and you get to 1.5 billion pixels per second.

BTW, this is independent of word width until you start getting into words so wide that they cause routing problems. So, yes, with proper design you could process 1.5 billion 32 bit words per second. Try that on a core i7 using software.

For reference, consumer HD requires about 75MHz for real-time FIR processing.

A single FPGA can have dozens of these blocks, equating to a staggering data processing rate that you simply could not approach with any modern CPU, no matter how many cores.

This is how GPU's work. Of course, the difference is that the GPU hardware is fixed, whereas FPGA's can be reconfigured with code.

My point is that we know how to use and take advantage of parallelism in applications that can use it. Most of what happens on computers running typical consumer applications has few --if any-- needs beyond what exists today. When it comes to FPGA's and this form of "extreme parallel programming", if you will, we use languages like Verilog and VHDL. Verilog looks very much like C and is easy to learn. You just have to think hardware rather than software.

The greater point might also be that when a problem lends itself to parallelization a custom hardware approach through a highly reconfigurable FPGA co-processor is probably the best solution.

The next major evolution in computing might very well be a fusion of the CPU with vast FPGA resources that could be called upon for application specific tasks. This has existed for many years in application-specific computing, starting with the Xilinx Virtex 2 PRO chips integrating multiple PowerPC processors atop the FPGA array. It'd be interesting to see the approach reach general computing. In other words, Intel buys Xilinx and they integrate a solution for consumer computing platforms with API's provided and supported by MS and Apple.

justincormack · on Dec 18, 2011

FPGAs have been the next big thing for at least 15 years, during which time the consumer GPU has made huge performance increases, and is turning into a general purpose parallel vector unit. Still seems to me the GPU is on track to fill this role and the fpga will stay niche, has anything changed to change this?

wtallis · on Dec 18, 2011

What's changed is the definition of GPU. 15 years ago, the most programmability a GPU had was a few registers to switch between a handful of hardcoded modes of operation. For the past several years, GPUs have been something like 80% composed of Turing-complete processors that can run arbitrary software, and they've more recently gained features like a unified address space with the host CPU, and the ability to partition the array of vector processors to run more than one piece of code at the same time. These features are 100% irrelevant to realtime 3d rendering.

GPU architectures don't exactly resemble FPGAs when viewed up close, but for the right kind of problem, an application programmer can now deal with a GPU in the same manner as an FPGA. (In fact, with OpenCL, you can use the exact same APIs.)

algoshift · on Dec 19, 2011

> FPGAs have been the next big thing for at least 15 years

It's a tough equation. They are certainly not aimed at or geared towards a commodity co-processor slot next to the CPU in a consumer PC.

What I am arguing or proposing is that if Intel acquired Xilinx and got Microsoft and Apple behind the idea of integrating an FPGA as a standard co-processor for the i-whatever chip we could see vast performance improvements for certain applications.

Per my example, I can take a $100 FPGA today and process 1.5 billion 32 bit pixels per second through an FIR pipeline. That same hardware is instantly reprogrammable to implement a myriad of other functions in hardware, not software. The possibilities are endless.

Many years ago I was involved in writing some code for DNA sequencing. We wrote the software to gain an understanding of what had to happen. Once we knew what to do, the design was converted to hardware. The performance boost was unbelievable. If I remember correctly, we were running the same data sets 1,000 times faster or more.

GPU's are great, but I've seen cases where a graphically intensive application makes use of the GPU and precludes anything else from using it. Not having done any GPU coding myself I can't really compare the two approaches (FPGA vs. GPU). The only thing I think I can say is that a GPU is denser and possibly faster because it is fixed logic rather than programmable (it doesn't have to support all the structures and layers that an FPGA has).

What I'd love to see is an FPGA grafted onto a multicore Intel chip with a pre-defined separate co-processor expansion channel via multiple MGT channels (Multi-Gigabit Transceivers). This would allow co-processor expansion beyond the resources provided on-chip. If the entire thing is encased within a published-and-supported standard we could open the doors to an amazing leap in computing for a wide range of advanced applications.

Let's see, Intel market cap is around $120 billion and Xilinx is somewhere around 8 billion. Yeah, let's do it.

jasonwatkinspdx · on Dec 18, 2011

You should take a look at cilk. They compile two versions of your code: one sequential, one parallel. The runtime switches between them dynamically. This avoids paying a lot of overhead managing parallelism for small subroutines, but still presents the programmer with a unified view of very fine grained concurrency.

larsberg · on Dec 18, 2011

Indeed, we see this today with our compiler (Manticore). On our 48-core AMD and 30-core Intel boxes, we now routinely have flattening speedup curves due to the small sequential portions of even our supposedly "embarrasingly parallel" benchmarks.

Compilers, even icc, are still shockingly bad at making good use of the x86_64 instruction set for sequential code. Part of that, of course, is due to C. But part of it is also our (my) fault as compiler writers because even with fantastic type information and unambiguous language semantics we emit shockingly dumb code.

jules · on Dec 18, 2011

Can you elaborate on x86_64 code generation? (are you referring specifically to the vector instructions?)

What do you think are the most promising directions to improve in this area?

larsberg · on Dec 18, 2011

Yes, certainly the vector instructions. There are a few keys issues many C compilers seem to run into today:

- Loop unroll identification is really bad. For example, ICC will unroll and turn a single-level loop with a very obvious body into streaming stores if the increment is "i++" but not if it is the constant "i+=1".

- The register allocation problem has some subtelties. Many of the SSE registers overlap the name of the multiple packed-value register with those of the individual ones (e.g. "AX = lower 16 of EAX"), so knowing that you want four numbers to be in the right place without additional moves means a little bit more thinking.

But, there's also very little control-flow analysis done or global program analysis done except for some basic link-time code generation and profile-guided optimization. There's a lot you can do (e.g. cross-module inlining; monomorphizing to remove uniform representation; dramatic representation changes of datatypes), though admittedly some of it requires a more static language with some additional semantic guarantees.

tomjen3 · on Dec 18, 2011

Well that law is a good theoretical instrument. But aside from some computations which have been deliberately designed to be impossible to parallize, it seems to me that Amdahls assumes that there is some constant part of the program that just can't be parallized. I disagree with that, and I think Erlang offers the best counterargument-- just because some part has to be done on one node (or thread) doesn't mean that the computer can't do other work at the same time.

Sure if your mental model of computation is do this, then that, and then do these two things in parrallel and then do these other things in parallel then gather all the results, this would suggest you have a problem under Amdahls law. If your mental model is a set of independent nodes sending messages to each other then when you run into a bottleneck, you spawn a few more of the most commonly used nodes. Suddenly there is no part of the program which has to be run sequentially while the rest of the world waits.

anamax · on Dec 18, 2011

No, this isn't a counter-argument.

You're not speeding up a given instance, you're "merely" running more instances or running larger problems. (In some problems, the sequential portion is roughly constant, independent of the amount of data. So, you can reduce the effect of the sequential portion by increasing the amount of data.)

More throughput and larger data are both good things but they're not the same as running a given problem faster.

comicjk · on Dec 18, 2011

Yes, but "90% of a task" is very different from "90% of the code."

The parallel parts tend to be the tight loops, etc where the program spends most of its time, which means for some tasks a number quite close to 100% of the runtime is parallelizable.

mochapixel64 · on Dec 18, 2011

I don't mean this as a backhanded compliment, but you're in no way in the small minority. You've very eloquently explained what pretty much everyone considers to be the actual situation.

dhughes · on Dec 18, 2011

Even way back in the mid 90s debated getting a Pentium Pro (two CPU) because I couldn't see how a CPU could keep getting faster. But I think I needed Win NT which was too expensive and I was still new to Linux.

nknight · on Dec 18, 2011

Not sure why you think you're expressing a minority opinion, you're stating generally accepted fact. We hit the soft limit years ago, that's why Intel finally dumped the Pentium 4, they realized they couldn't make it scale.

Everything since has been about finding ways to do more with fewer cycles, and more research and education on parallelization at all levels (from on-die branch prediction up to google-scale distributed systems). I don't know of anyone who disputes this in principle.

repsilat · on Dec 18, 2011

It's generally accepted fact that "the free ride is over". It's more difficult to argue that we'd be better off if it had happened sooner, that we'd escape more quickly to better global optima if the local optimum we're stuck in wasn't quite so deep.

Improvements to the processors we've made by adding more cache, executing instructions out of order, predicting branches better etc are making it more difficult to escape from the conclusion we've reached. To use GP's analogy, there's now a gas station for every suburb and a garage for every house. We know better solutions are possible, but getting from here to there is going to be difficult and expensive, and it could be worse than the status quo for the early adopters in the short term.

adambyrtek · on Dec 18, 2011

You are right, this has been predicted some time ago. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software[1] from 2005 is the seminal essay on this subject.

[1] http://www.gotw.ca/publications/concurrency-ddj.htm