The problem with OpenCL isn't performance per se, but performance portability (w...

14113 · on Sept 11, 2017

The LIFT project (http://www.lift-project.org/) is specifically trying to solve the problem of performance portability. Our approach relies on a high level model of computation (think or something like a functional, or pattern based programming language) coupled with a rewrite-based compiler that explores the space of OpenCL programs with which to implement a computation.

We get really quite good results over a number of benchmarks - check out our papers!

geokon · on Sept 12, 2017

How does it compare to SYCL that someone else mentioned in another comment?

Sounds like it's trying to do a similar thing

roel_v · on Sept 11, 2017

That's really cool, thanks for mentioning.

vanderZwan · on Sept 11, 2017

> This is of course something you don't have with an API that works only on GPU's from one vendor, although even there different generations of hardware might prefer different parameters or tradeoffs.

The paper seems to confirm your last caveat. Each point on the following summary sounds like they require fine-tuning it's hardware-dependent down to the specific model, except maybe the second-to-last point about which approach works best in general:

Our key findings are the following:

• Effective parallel sorting algorithms must use the faster access on-chip memory as much and as often as possible as a substitute to global memory operations.

• Algorithmic improvements that used on-chip memory and made threads work more evenly seemed to be more effective than those that simply encoded sorts as primitive GPU operations.

• Communication and synchronization should be done at points specified by the hardware.

• Which GPU primitives (scan and 1-bit scatter in particular) are used makes a big difference. Some primitive implementations were simply more efficient than others, and some exhibit a greater degree of fine grained parallelism than others.

• A combination of radix sort, a bucketization scheme, and a sorting network per scalar processor seems to be the combination that achieves the best results.

• Finally, more so than any of the other points above, using on-chip memory and registers as effectively as possible is key to an effective GPU sort.

hyperpallium · on Sept 11, 2017

Im my brief play with using GPU graphics for compute (e.g. render to texture) vs specialized compute on GPU was that the latter required tweaking, but was still slower.

My sense is that paralelization still isn't solved in general, so (a bit like NP "reduction") if you can't cast your problem in terms of an "embarrassingly parallelizable" (ep) case like rendering, it's not going to be very fast. Plus, the rendering pipeline has had all hell optimized out of it.

Put another way: what features could a GPU general language have that are ep, but with no equivalent available in GPU graphics languages?

I think there are some trivial ones, e.g. older openGL ES (mobile) don't have render-to-float-texture - a crucial and ep feature for general compute.

pjmlp · on Sept 11, 2017

One big reason for developers favouring CUDA is that since the early days it supported C++, Fortran and any other language with a PTX backend, where Khronos wanted everyone to just shut up and use C99.

Finally they understood that they world moved on and better support to other languages had to be provided, so lets see how much OpenCL 2.2 and SPIR can improve the situation.

ryanpepper · on Sept 11, 2017

In academia it's also because NVidia does a lot of stuff to make your life easy.

For example, NVidia came to our University in the UK and provided training for £20 an academic/PhD student for a 2 day course on how to use CUDA and with performance tips, hands on porting of code, etc. They also give away CUDA cards to academics under a hardware grant scheme, so it's possible to get a free Titan Xp this year for a research group.

There's not really an equivalent for AMD or Intel; a Xeon Phi Knights Landing chip is significantly more expensive than a consumer level GPU, and the same cost as a workstation GPU, and it's a lot harder to get good performance from it. It also doesn't seem like AMD are targeting this market, at least not currently.

dragontamer · on Sept 11, 2017

The main problem is that NVidia is screwing things up. NVidia is only supporting OpenCL1.1, which means that if you want to use C++ / SPIR, you pretty much are locked to AMD / Intel (Intel CPUs have an OpenCL -> AVX layer, so you can always "worst-case" turn OpenCL code into native CPU code)

NVidia of course owns CUDA, which means they want those "premium features" locked to CUDA-only.

--------

AMD's laptop offerings offer some intriguing features on OpenCL as well. Since their APUs have a CPU AND a GPU on the same die, the data-transfer between CPU / GPU on the AMD APUs (ie: an A10 laptop chip) is absurdly fast. Like, they share L2 cache IIRC, so the data doesn't even hit main-memory, or even leave the chip.

But there's basically no point optimizing for that architecture, as far as I can tell anyway.

pjmlp · on Sept 11, 2017

Another example would be with Vulkan I guess.

NVidia has lots of models with DX 12 level support that don't have Vulkan drivers.