> This is of course something you don't have with an API that works only on GPU's from one vendor, although even there different generations of hardware might prefer different parameters or tradeoffs.
The paper seems to confirm your last caveat. Each point on the following summary sounds like they require fine-tuning it's hardware-dependent down to the specific model, except maybe the second-to-last point about which approach works best in general:
Our key findings are the following:
• Effective parallel sorting algorithms must use the faster access on-chip memory as much and as often as possible as a substitute to global memory operations.
• Algorithmic improvements that used on-chip memory and made threads work more evenly seemed to be more effective than those that simply encoded sorts as primitive GPU operations.
• Communication and synchronization should be done at points specified by the hardware.
• Which GPU primitives (scan and 1-bit scatter in particular) are used makes a big difference. Some primitive implementations were simply more efficient than others, and some exhibit a greater degree of fine grained parallelism than others.
• A combination of radix sort, a bucketization scheme, and a sorting network per scalar processor seems to be the combination that achieves the best results.
• Finally, more so than any of the other points above, using on-chip memory and registers as effectively as possible is key to an effective GPU sort.
Im my brief play with using GPU graphics for compute (e.g. render to texture) vs specialized compute on GPU was that the latter required tweaking, but was still slower.
My sense is that paralelization still isn't solved in general, so (a bit like NP "reduction") if you can't cast your problem in terms of an "embarrassingly parallelizable" (ep) case like rendering, it's not going to be very fast. Plus, the rendering pipeline has had all hell optimized out of it.
Put another way: what features could a GPU general language have that are ep, but with no equivalent available in GPU graphics languages?
I think there are some trivial ones, e.g. older openGL ES (mobile) don't have render-to-float-texture - a crucial and ep feature for general compute.
The paper seems to confirm your last caveat. Each point on the following summary sounds like they require fine-tuning it's hardware-dependent down to the specific model, except maybe the second-to-last point about which approach works best in general:
Our key findings are the following:
• Effective parallel sorting algorithms must use the faster access on-chip memory as much and as often as possible as a substitute to global memory operations.
• Algorithmic improvements that used on-chip memory and made threads work more evenly seemed to be more effective than those that simply encoded sorts as primitive GPU operations.
• Communication and synchronization should be done at points specified by the hardware.
• Which GPU primitives (scan and 1-bit scatter in particular) are used makes a big difference. Some primitive implementations were simply more efficient than others, and some exhibit a greater degree of fine grained parallelism than others.
• A combination of radix sort, a bucketization scheme, and a sorting network per scalar processor seems to be the combination that achieves the best results.
• Finally, more so than any of the other points above, using on-chip memory and registers as effectively as possible is key to an effective GPU sort.