The problem with OpenCL isn't performance per se, but performance portability (well it's only a problem for those that need such a thing, of course - many people don't). When you write OpenCL code and you tweak it for one CPU or GPU, it might run at 1/10th the speed on another. This is of course something you don't have with an API that works only on GPU's from one vendor, although even there different generations of hardware might prefer different parameters or tradeoffs.
Now you can write OpenCL kernels that automatically tweak themselves to run as fast as possible on different hardware, but that requires significant extra work over just getting it to work at all.
And finally, CUDA has a bunch of hand-tweaked libraries for doing common numerical operations (matrix multiply, FFT, ...) that are (partly) written in 'NVIDIA GPU assembly) (ptx), so those operations will be faster on CUDA than on OpenCL.
CUDA is also (a bit) easier to write/use than OpenCL code and the tooling is better, so that's another reason people often default to CUDA.
The LIFT project (http://www.lift-project.org/) is specifically trying to solve the problem of performance portability. Our approach relies on a high level model of computation (think or something like a functional, or pattern based programming language) coupled with a rewrite-based compiler that explores the space of OpenCL programs with which to implement a computation.
We get really quite good results over a number of benchmarks - check out our papers!
> This is of course something you don't have with an API that works only on GPU's from one vendor, although even there different generations of hardware might prefer different parameters or tradeoffs.
The paper seems to confirm your last caveat. Each point on the following summary sounds like they require fine-tuning it's hardware-dependent down to the specific model, except maybe the second-to-last point about which approach works best in general:
Our key findings are the following:
• Effective parallel sorting algorithms must use the faster access on-chip memory as much and as often as possible as a substitute to global memory operations.
• Algorithmic improvements that used on-chip memory and made threads work more evenly seemed to be more effective than those that simply encoded sorts as primitive GPU operations.
• Communication and synchronization should be done at points specified by the hardware.
• Which GPU primitives (scan and 1-bit scatter in particular) are used makes a big difference. Some primitive implementations were simply more efficient than others, and some exhibit a greater degree of fine grained parallelism than others.
• A combination of radix sort, a bucketization scheme, and a sorting network per scalar processor seems to be the combination that achieves the best results.
• Finally, more so than any of the other points above, using on-chip memory and registers as effectively as possible is key to an effective GPU sort.
Im my brief play with using GPU graphics for compute (e.g. render to texture) vs specialized compute on GPU was that the latter required tweaking, but was still slower.
My sense is that paralelization still isn't solved in general, so (a bit like NP "reduction") if you can't cast your problem in terms of an "embarrassingly parallelizable" (ep) case like rendering, it's not going to be very fast. Plus, the rendering pipeline has had all hell optimized out of it.
Put another way: what features could a GPU general language have that are ep, but with no equivalent available in GPU graphics languages?
I think there are some trivial ones, e.g. older openGL ES (mobile) don't have render-to-float-texture - a crucial and ep feature for general compute.
One big reason for developers favouring CUDA is that since the early days it supported C++, Fortran and any other language with a PTX backend, where Khronos wanted everyone to just shut up and use C99.
Finally they understood that they world moved on and better support to other languages had to be provided, so lets see how much OpenCL 2.2 and SPIR can improve the situation.
In academia it's also because NVidia does a lot of stuff to make your life easy.
For example, NVidia came to our University in the UK and provided training for £20 an academic/PhD student for a 2 day course on how to use CUDA and with performance tips, hands on porting of code, etc. They also give away CUDA cards to academics under a hardware grant scheme, so it's possible to get a free Titan Xp this year for a research group.
There's not really an equivalent for AMD or Intel; a Xeon Phi Knights Landing chip is significantly more expensive than a consumer level GPU, and the same cost as a workstation GPU, and it's a lot harder to get good performance from it. It also doesn't seem like AMD are targeting this market, at least not currently.
The main problem is that NVidia is screwing things up. NVidia is only supporting OpenCL1.1, which means that if you want to use C++ / SPIR, you pretty much are locked to AMD / Intel (Intel CPUs have an OpenCL -> AVX layer, so you can always "worst-case" turn OpenCL code into native CPU code)
NVidia of course owns CUDA, which means they want those "premium features" locked to CUDA-only.
--------
AMD's laptop offerings offer some intriguing features on OpenCL as well. Since their APUs have a CPU AND a GPU on the same die, the data-transfer between CPU / GPU on the AMD APUs (ie: an A10 laptop chip) is absurdly fast. Like, they share L2 cache IIRC, so the data doesn't even hit main-memory, or even leave the chip.
But there's basically no point optimizing for that architecture, as far as I can tell anyway.
Now you can write OpenCL kernels that automatically tweak themselves to run as fast as possible on different hardware, but that requires significant extra work over just getting it to work at all.
And finally, CUDA has a bunch of hand-tweaked libraries for doing common numerical operations (matrix multiply, FFT, ...) that are (partly) written in 'NVIDIA GPU assembly) (ptx), so those operations will be faster on CUDA than on OpenCL.
CUDA is also (a bit) easier to write/use than OpenCL code and the tooling is better, so that's another reason people often default to CUDA.