Sorting with GPUs: A Survey

_pctq · on Sept 11, 2017

Cool to see research in that field, I realized this year that for gaming GPU and VRAM were actually more important than CPU and RAM, it probably tells something about underusage of GPU in general computing.

There's one big blocker as far as I can tell, though: portability. When it comes to gaming, neural networks, cryptomining, etc, I always see "only nvidia cards supported", or "work best on amd". If we were to use GPU in just any kind of application, we would need to have hardware abstraction library which support any kind of GPU, including intel chips.

Is such effort already being worked on, at any stage of completion?

idle_zealot · on Sept 11, 2017

Something like https://en.wikipedia.org/wiki/OpenCL ?

_pctq · on Sept 11, 2017

Awesome, thanks. I've seen the name mentioned several times, but I didn't know what was behind.

It seems (unsurprisingly for an hardware abstraction) that performance is a problem, but at least it's a problem that is being worked on. Maybe at some point we will stop to think "we can't implement that, that's way too massive to perform properly" and start thinking "this is a job for the GPU" :) At the very least, databases come to mind, with their sorting/filtering of massive data.

roel_v · on Sept 11, 2017

The problem with OpenCL isn't performance per se, but performance portability (well it's only a problem for those that need such a thing, of course - many people don't). When you write OpenCL code and you tweak it for one CPU or GPU, it might run at 1/10th the speed on another. This is of course something you don't have with an API that works only on GPU's from one vendor, although even there different generations of hardware might prefer different parameters or tradeoffs.

Now you can write OpenCL kernels that automatically tweak themselves to run as fast as possible on different hardware, but that requires significant extra work over just getting it to work at all.

And finally, CUDA has a bunch of hand-tweaked libraries for doing common numerical operations (matrix multiply, FFT, ...) that are (partly) written in 'NVIDIA GPU assembly) (ptx), so those operations will be faster on CUDA than on OpenCL.

CUDA is also (a bit) easier to write/use than OpenCL code and the tooling is better, so that's another reason people often default to CUDA.

14113 · on Sept 11, 2017

The LIFT project (http://www.lift-project.org/) is specifically trying to solve the problem of performance portability. Our approach relies on a high level model of computation (think or something like a functional, or pattern based programming language) coupled with a rewrite-based compiler that explores the space of OpenCL programs with which to implement a computation.

We get really quite good results over a number of benchmarks - check out our papers!

geokon · on Sept 12, 2017

How does it compare to SYCL that someone else mentioned in another comment?

Sounds like it's trying to do a similar thing

roel_v · on Sept 11, 2017

That's really cool, thanks for mentioning.

vanderZwan · on Sept 11, 2017

> This is of course something you don't have with an API that works only on GPU's from one vendor, although even there different generations of hardware might prefer different parameters or tradeoffs.

The paper seems to confirm your last caveat. Each point on the following summary sounds like they require fine-tuning it's hardware-dependent down to the specific model, except maybe the second-to-last point about which approach works best in general:

Our key findings are the following:

• Effective parallel sorting algorithms must use the faster access on-chip memory as much and as often as possible as a substitute to global memory operations.

• Algorithmic improvements that used on-chip memory and made threads work more evenly seemed to be more effective than those that simply encoded sorts as primitive GPU operations.

• Communication and synchronization should be done at points specified by the hardware.

• Which GPU primitives (scan and 1-bit scatter in particular) are used makes a big difference. Some primitive implementations were simply more efficient than others, and some exhibit a greater degree of fine grained parallelism than others.

• A combination of radix sort, a bucketization scheme, and a sorting network per scalar processor seems to be the combination that achieves the best results.

• Finally, more so than any of the other points above, using on-chip memory and registers as effectively as possible is key to an effective GPU sort.

hyperpallium · on Sept 11, 2017

Im my brief play with using GPU graphics for compute (e.g. render to texture) vs specialized compute on GPU was that the latter required tweaking, but was still slower.

My sense is that paralelization still isn't solved in general, so (a bit like NP "reduction") if you can't cast your problem in terms of an "embarrassingly parallelizable" (ep) case like rendering, it's not going to be very fast. Plus, the rendering pipeline has had all hell optimized out of it.

Put another way: what features could a GPU general language have that are ep, but with no equivalent available in GPU graphics languages?

I think there are some trivial ones, e.g. older openGL ES (mobile) don't have render-to-float-texture - a crucial and ep feature for general compute.

pjmlp · on Sept 11, 2017

One big reason for developers favouring CUDA is that since the early days it supported C++, Fortran and any other language with a PTX backend, where Khronos wanted everyone to just shut up and use C99.

Finally they understood that they world moved on and better support to other languages had to be provided, so lets see how much OpenCL 2.2 and SPIR can improve the situation.

ryanpepper · on Sept 11, 2017

In academia it's also because NVidia does a lot of stuff to make your life easy.

For example, NVidia came to our University in the UK and provided training for £20 an academic/PhD student for a 2 day course on how to use CUDA and with performance tips, hands on porting of code, etc. They also give away CUDA cards to academics under a hardware grant scheme, so it's possible to get a free Titan Xp this year for a research group.

There's not really an equivalent for AMD or Intel; a Xeon Phi Knights Landing chip is significantly more expensive than a consumer level GPU, and the same cost as a workstation GPU, and it's a lot harder to get good performance from it. It also doesn't seem like AMD are targeting this market, at least not currently.

dragontamer · on Sept 11, 2017

The main problem is that NVidia is screwing things up. NVidia is only supporting OpenCL1.1, which means that if you want to use C++ / SPIR, you pretty much are locked to AMD / Intel (Intel CPUs have an OpenCL -> AVX layer, so you can always "worst-case" turn OpenCL code into native CPU code)

NVidia of course owns CUDA, which means they want those "premium features" locked to CUDA-only.

--------

AMD's laptop offerings offer some intriguing features on OpenCL as well. Since their APUs have a CPU AND a GPU on the same die, the data-transfer between CPU / GPU on the AMD APUs (ie: an A10 laptop chip) is absurdly fast. Like, they share L2 cache IIRC, so the data doesn't even hit main-memory, or even leave the chip.

But there's basically no point optimizing for that architecture, as far as I can tell anyway.

pjmlp · on Sept 11, 2017

Another example would be with Vulkan I guess.

NVidia has lots of models with DX 12 level support that don't have Vulkan drivers.

autopoiesis · on Sept 11, 2017

Also there's work towards a pure C++ abstraction layer for programming these accelerators, called SYCL [0]. It lets you define your compute kernels using normal C++ lambdas or functions, and then automatically infers the data dependencies to do the kernel execution. In particular, it provides an implementation of the C++ "Parallel STL" [1] (some intro slides at [2]), which in turn provides "execution policies" for various standard library functions, such as sorting. The aim is precisely the kind of thing you're talking about!

[0] I found a nice intro at https://blog.tartanllama.xyz/sycl/

[1] http://en.cppreference.com/w/cpp/experimental/parallelism

[2] https://www.khronos.org/assets/uploads/developers/library/20...

richdougherty · on Sept 11, 2017

Interesting article about OpenCL and Vulkan... https://www.pcper.com/reviews/Graphics-Cards/Follow-Neil-Tre...

falcolas · on Sept 11, 2017

> for gaming GPU and VRAM were actually more important than CPU and RAM

It's probably just the games I play, but these days I'm rarely GPU bound, just CPU bound. It would be great to see some frameworks make GPU accelerated physics (with automatic CPU fallback) easier for smaller game development companies to take advantage of.

For example: Factorio and Space Engineers are primarily CPU bound due to the tracking of millions of objects (Factorio) & physics (Space Engineers).

overcast · on Sept 11, 2017

It must be the games you're playing, large scale strategy? Everything in my gaming rig is from 2008(LGA1366). The only thing I've upgraded since then is video cards. I ran an I7 920 Nehalem chip up until last year, when I "upgraded" to a Xeon X5675 from ebay as a drop in cpu replacement. Runs everything I throw at it completely maxed out at 2560x1600 resolution vsynced to the monitor refresh rate of 60hz.

meschi · on Sept 11, 2017

Nvidia PhysX?

14113 · on Sept 11, 2017

The problem isn't that a particular library is AMD or NVIDIA only in terms of computation, but NVIDIA or AMD only in terms of performance. GPU implementations are usually optimised for a specific architecture, which is then the "supported" architecture.

On occasion, such implementations do use vendor specific tools (such as CUDA), but there are a plethora of tools such as OpenCL, SyCL etc that provide portability - but not always performance portability, meaning that they will still be tuned to a specific architecture.

For performance portability, the LIFT project (http://www.lift-project.org/) provides a partial solution. Our approach relies on a high level model of computation (think or something like a functional, or pattern based programming language) coupled with a rewrite-based compiler that explores the space of OpenCL programs with which to implement a computation.

That lets us "optimise" a given implementation to a specific architecture, entirely automatically, in a way that many other low level approaches simply aren't able to, as they contain too many implementation (rather than computation) details.

marmaduke · on Sept 11, 2017

Looks interesting conceptually but it would be good to refute the "sufficiently clever compiler" in anticipation. My beef with this sort of thing or Furthark or OpenACC is that they don't seem to understand complex data layouts required for real problems.

14113 · on Sept 11, 2017

Define "complex data layouts"? One reason that tools like Futhark often don't is that complex data structures don't provide good performance on GPUs.

If, however, you mean complicated compositions of arrays, then that is something we support, as well as efficient ways for describing (e.g.) coalesced accesses or stencil operations.

marmaduke · on Sept 17, 2017

Thanks for the reply. I should not have included Furthark with OpenACC.

The layouts I have in mind are ring buffer of arrays and ND arrays.

I think Furthark would work well actually but I simply had not the time to get accustomed to it.

Eridrus · on Sept 11, 2017

AMD is taking a stab at this with ROCm/HIP/HCC to try and get a slice of the machine learning market: https://rocm.github.io/languages.html

I wish them all the best, but I haven't seen much adoption yet. It's not particularly clear that this is the right path yet because folks are building special-purpose neural net accelerators, and those may not have the same programming model and may make this whole HIP thing irrelevant for ML.

I'm also not totally convinced software developers are ready to take advantage of something like this; developers are barely taking advantage of multiple CPU cores, let alone the more limited GPGPU environment.

VHRanger · on Sept 11, 2017

OpenACC

exDM69 · on Sept 11, 2017

Sorting is a fundamental problem so this is important stuff but I can't right now come up with a practical problem involving large sorting operations. Can anyone come up with a practical application for this? I have no doubt they exist.

I would think the breakeven point for using the GPU (assuming inputs and results are on the CPU) is several megabytes of millions or elements at least.

Writing this paper must have been a lot of effort. There are something like 50 different methods reviewed here. Good thing that papers like this exist.

ACow_Adonis · on Sept 11, 2017

It has potential in the field of data linking: Graph and database partitioning, merging, indexing, sorted-neighbourhood algorithms. And cases where you can't use hash-joins or comparison techniques because you're using the available ram and resources to hold something else.

Optimisation for cache-locality and branch prediction.

Any kind of repeatable scientific/datascience analysis operating across groups or cross-tabulation. Anything involving lag-like functions and relationships, time-series, relative positions (in households, neighbourhoods, countries, spatial relationships).

Perhaps i'm biased, I think i've grown up working with operations where having the data in an implicit order makes things a fair bit easier/more efficient.

Albeit, you're also right, it probably has to get up into the millions before you fundamentally care. But I do like to think I work on practical applications :P

Ono-Sendai · on Sept 11, 2017

I think the most common practical problem for sorting large amounts of data is to enable efficient search operations later, for example binary search over sorted data.

I also agree that needing to sort large amounts of data fast is actually not that common a requirement.

EDIT: I would add that the speed of sorting on CPU is often underestimated. See my blog post about fast multithreaded radix sort on CPU here: http://www.forwardscattering.org/post/34

pelario · on Sept 11, 2017

Two practical applications from top of my head:

* Do you want to build an index (for big data?), so that you can access your data fast (thing about a database). Well, the index construction is likely to do some sorting internally. * MapReduce algorithm has indeed more phases... one of it is sorting...so yes, when you are porocessing big data with map reduce, then the data is being sorted at some point.

_3u10 · on Sept 11, 2017

Indexes don't really require data to be sorted, sorting is a consequence of adding data to many different types of index.

https://en.wikipedia.org/wiki/B-tree

ben-schaaf · on Sept 11, 2017

Rendering transparent models in real-time (ie. for games) requires a sorting step when they overlap. I can imagine a scene containing a couple hundred thousand transparent models that need sorting.

exDM69 · on Sept 11, 2017

That's obviously a good use case and doesn't even need to have that large of number of elements because the data could get consumed by the GPU, so no back-and-forth transfer necessary.

This paper was focused on comparison based sorting. Depth sorting can be done with GPU radix sort (which is super fast), because with minor modifications, floating point and integer comparison are equal for finite, not-NaN values (and games don't care about that).

stcredzero · on Sept 11, 2017

Rendering transparent models in real-time (ie. for games) requires a sorting step when they overlap.

My game heavily uses a "find nearest n" operation on an R-Tree, then sorts by distance for AI operations.

skocznymroczny · on Sept 11, 2017

Are there any game engines that actually do the sorting? I thought they just render them in random order and hope for the best.

tomjakubowski · on Sept 11, 2017

I was very surprised that depth sorting for real time rendering was not listed among the potential applications in the abstract.

pjc50 · on Sept 11, 2017

Isn't that usually done with the depth buffer?

panic · on Sept 11, 2017

Depth buffers only work well for opaque objects, which cover each other completely. With partial coverage or blending, multiple objects could end up contributing to a pixel in an order-dependent way (e.g., if you're viewing an anti-aliased leaf edge under the surface of water through a window). The depth buffer, which only stores a single depth, doesn't solve this problem directly -- you can use it to render each transparent layer one-by-one ("depth peeling"), but this can be slow.

heavenlyblue · on Sept 11, 2017

Depth buffer has an issue of scales: one thing is 1 meter away and another one is 1 million meters away with another several thousand billion meters away. That makes it practically unusable for global positioning (especially for things that are both far away, but another one is just a little bit further).

On the other hand sorting doesn't need to have a global axis - in the perfect case it just needs to compare two elements against each other.

exDM69 · on Sept 11, 2017

For non-transparent models, to a degree it gets done by the depth buffer. For transparent models, depth buffer is not helpful.

But even when using the depth buffer, drawing in front-to-back order will drastically improve performance as fragment shaders don't have to run for occluded fragments that get depth culled.

grahameb · on Sept 11, 2017

Seismic exploration data needs to be sorted into various domains; for example, by offset, by receiver, or by shot. It is one of the faster operations but you can still end up CPU bound rather than I/O bound, especially if you're doing work on the fly in a GUI.

http://wiki.seg.org/wiki/CMP_sorting

barrkel · on Sept 11, 2017

I work for a startup that specializes in data matching (usually called reconciliation in banking / finance industry). Some core matching algorithms rely on comparing sorted input using sliding windows; others rely on bucketing of some kind; anything to avoid O(m*n) when comparing two lists.

The lists involved can easily run into the millions, depending on the domain.

roel_v · on Sept 11, 2017

I have yet to see a non-trivial simulation model that doesn't need sorting at some point in its operations. But I do think you're right that sorting isn't something that would be bottleneck in today's mainstream applications of GPU's - games, compression/decompression (video, mostly) and graphics operations.

taeric · on Sept 11, 2017

Sorting has long been known as something that can be bound with more comparison units. Batcher's sort, as an example, is far from a new algorithm. So, to that end, being able to sort a collection in lgN time is quite appealing.

What I take this paper as doing, is seeing how close we are to being able to take it for granted that you can sort large collections in lgN time, instead of NlgN time. My guess is we are a ways from there, but probably not as far as I think.

lacksconfidence · on Sept 11, 2017

XGBoost recently added GPU support for training. As far as I'm aware one of the reasons the speedup is relatively small (vs say deep learning on CPU vs GPU) is that it spends a significant amount of time sorting which is not particularly fast on the GPU.

tromp · on Sept 11, 2017

Sorting also forms the bottleneck in several memory hard Proof-of-Work schemes. These differ from the general problem in two important ways:

1) the input data is generated from cryptographic hash functions, and can thus be considered random, leading to e.g. very even distribution of elements over buckets.

2) each round of sorting serves to identify groups of items that match on a subrange of their bits, and somehow transform these items.

There is a lot of cryptocurrency money to be made by developing the most optimized mining software and charging 1% or so "developer-fee" on the mining proceeds.

dragontamer · on Sept 11, 2017

> I would think the breakeven point for using the GPU (assuming inputs and results are on the CPU) is several megabytes of millions or elements at least.

Well, the idea of GPU computing is to keep the data on the GPU as much as possible.

The CPU <---> GPU link is at best, 16x PCIe. Which is fast, but not anywhere close to the speed of HBM, GDDR5 or DDR4 RAM. (Exception: AMD's A10 operators, which have a GPU / CPU which shares on-chip cache memory. As well as the "Crystalwell" Intel chips IIRC. But these are rare chips that I doubt most people use)

So if you happen to be running a major algorithm on the GPU, you'll probably want to sort it as well, before bringing it back to the CPU. Under no circumstances should you be introducing a PCIe delay for a simple operation like sorting.

gnarbarian · on Sept 11, 2017

I'm using GPU.js for a nbody gravity simulation in three.js. So far I'm liking GPU.js but it has it's limitations. only being able to return a single float from any function causes a lot of redundancy. For example, in order to get the 3d acceleration vector for one body I have to call the function once for each dimension, recomputing all of the temporary variables each pass. Then I have to make another pass for collision detection so it ends up being about 4n^2 vs just n^2

nonetheless it's still about an order of magnitude faster than the pure CPU implementation I have.

In the future they plan to add webGL 2.0 and OpenCL support which should improve the flexibility of the library.

http://thedagda.co:9000/?stars=true&bodyCount=1000

you can toggle CPU nbody computation with:

?CPU=true in the url. remove it for GPU computation.

you can toggle the number of planets with the bodyCount variable. default is 1000

if you have a beefy computer try bodyCount=4000

gamepad=true works if you have a xbox controller hooked up.

on gitHub:

https://github.com/ubernaut/spaceSim

Zelizz · on Sept 12, 2017

Why not calculate the unchanging intermediate results in separate functions so that you can pass them as arguments to the final function?

gnarbarian · on Sept 12, 2017

Easier said than done. I can't get superkernels to work at all. I've tried.

leecarraher · on Sept 12, 2017

Sorting seems like a memory hard problem where the computationally optimal, merge sort solution, is heavily branched, two things gpus have never been terribly good at. Furthermore, the synchronized program counter per block of threads, presents a considerable road block to getting optimal thread occupancy specifically in the cuda architecture.

frozenport · on Sept 11, 2017

The table at the end would benefit from a column indicating the availability of the source code.