Foreword: I'm biased, as I worked on CUDA for several years. The conclusions off...

Foreword: I'm biased, as I worked on CUDA for several years.

The conclusions offered by this deck are mostly FUD.

First of all, Haswell, the architecture where those transactional memory primitives are available, isn't out for another year. Saying Knights Corner was available on November 2011 is also deceptive; Intel demo'd it at SC11, but you can't buy one yet.

Second, he helpfully glosses over that Cilk++'s elemental functions are identical to how CUDA and ISPC work; write a specially decorated single-threaded function, use a specially decorated function call, and end up with parallel work. I think it's exceedingly likely that the industry will standardize on this as the data-parallel methodology of choice within the next ten years. That timeframe will depend on how quickly GPUs and CPUs converge in terms of functionality (with vastly different performance characteristics). Task-parallel stuff will be done with something else.

The really difficult question will be how to get performance portability. C++ (or Fortran) code that runs well on Haswell will probably run like crap on KNC and vice-versa due to differences in the number of threads you need in flight, cache sizes, vast latency differences, etc. (Look at OpenCL running on two GPUs or especially CPU vs GPU as an example today.) Solving that is going to be the real challenge.