By async compute, do you mean something different than what CUDA streams expose?...

shmerl · on March 1, 2021

I mean actual hardware capabilities. AMD GPUs were ahead in that.

See: http://developer.amd.com/wordpress/media/2012/10/Asynchronou...

oivey · on March 1, 2021

This only seems to applicable to shaders? GPGPU in most contexts is referring to code divorced from any graphics pipeline, like CUDA or ROCm. CUDA has had asynchronous and concurrent kernel calls for a long time. How are asynchronous compute shaders relevant in that context?

monocasa · on March 1, 2021

Compute shaders are divorced from the graphics pipeline as well. As far as the hardware is concerned, CUDA/ROCm and compute shaders are the same thing.

Jasper_ · on March 1, 2021

Compute shaders and CUDA have very different execution and driver models. Just take a look at the way memory is managed in each.

monocasa · on March 1, 2021

Can you be specific? Everything I've seen looks like a user space host driver difference.

Also, I the case of "asynchronous shaders" it's a command list dispatching feature, which is equally applicable to CUDA/ROCm and compute shaders.

Jasper_ · on March 2, 2021

Have you seen CUDA? It's C++, where half of it runs on the CPU, and half of it runs on the GPU. Memory transfers are automatically scheduled for you! No, obviously the hardware doesn't magically transform when you run a CUDA program, but there's a lot of features in the hardware that just aren't exercised by compute shaders, and the 'user-space host driver' is very different.

> Also, I the case of "asynchronous shaders" it's a command list dispatching feature, which is equally applicable to CUDA/ROCm and compute shaders.

Async compute affects how you have to synchronize your code, and what your hazards are. If you have a queue of items, you know when one starts that the other is done. With async compute, you can give up that guarantee in exchange for overlapping work. NVIDIA still doesn't have this in modern GPUs.

my123 · on March 2, 2021

You have async compute on NVIDIA GPUs.

See this slide deck about it: https://developer.download.nvidia.com/CUDA/training/StreamsA...

(for the implementation in Fermi, it's even more flexible since then)

oivey · on March 1, 2021

CUDA has had asynchronous command lists (multiple, actually, via streams) forever. What does AMD’s special hardware enable? Is it exposed in ROCm somehow?

Edit: Maybe I should be more precise with my language. Using CUDA streams, you can queue up and execute multiple kernels (basically commands?) in parallel. Kernels invoked on the same stream are serialized. I have no idea what the exact analog is in the compute shader world.

monocasa · on March 1, 2021

"Asynchronous shaders" doesn't mean that host code is unblocked on job dispatch (that's been the case for nearly all GPU APIs for going on 40 years). It's having multiple hardware queues with different priorities to be able to have other work to do when a kernel/shader is only using a subset of the SMs/CUs. So it's a hardware balancing feature for non heterogeneous workloads.

Nvidia hardware won't dispatch multiple kernels simultaneously, all of the streams stuff you're seeing is hints to the user space driver in how it orders the work in it's single hardware command queue.

oivey · on March 2, 2021

Is that not what this is? https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

I thought this allowed kernels to run in parallel if both kernels could fit in the open resources. It doesn't have a priority system, and I'm guessing isn't smart enough to reorder kernels to fit together better. Further Googling to be sure that they mean parallel AND concurrent didn't turn up much, but Nvidia does appear to mean both.

my123 · on March 2, 2021

cudaStreamCreateWithPriority is what you'd like to use for creating a stream with a specifically defined priority.

The GPU is able to run up to 128 kernels in parallel on modern NVIDIA GPUs, concurrently.

oivey · on March 2, 2021

Ahh ok. I’ve had a similar conversation with a graphics guy that strongly paralleled this conversation, and I was confused then, too. Seemingly the asynchronous compute issues are a limitation when it only comes to shaders I guess? I write some CUDA but no shaders, and certainly not so much that I’d make super concrete claims about more esoteric features like streams.

my123 · on March 1, 2021

Were. AMD totally squandered their advantage that they had with launching GCN early through an absolutely awful software stack.

shmerl · on March 1, 2021

RDNA 2 only improved on the above. Nvidia were and remain a dead end with CUDA lock-in, but I agree that there should be more combined efforts to break it.

my123 · on March 1, 2021

Lol? ROCm is a mess.

It's awful. And it is not even supported on RDNA and RDNA2. They even dropped Polaris support last month so that it's only supported on Vega.

And that's very much not a high quality GPU compute stack, sorry...

CUDA isn't a dead end and isn't going away any time soon.

shmerl · on March 1, 2021

I was talking about hardware, not about ROCm. I said above, GPU compute could benefit from a nice programming model with a modern language. That will dislodge CUDA lock-in for good.

CUDA isn't dead, but there should be a bigger effort to get rid of it because it's not a proper GPU programming tool but rather a market manipulation tool by Nvidia.

my123 · on March 1, 2021

Nowadays, modern Nvidia hardware (Volta onwards) handles compute workloads better, notably through independent thread scheduling.

RDNA2 with its cache and low main memory bandwidth didn't help either, because while the cache is there, you're going to exceed what it can absorb by a lot in compute workloads...

shmerl · on March 1, 2021

I doubt there is major difference. And if anything, AMD caught up to Nvidia in other areas. CUDA isn't used because Nvidia hardware is better. If it was about hardware, Nvidia wouldn't have used lock-in.

my123 · on March 2, 2021

AMD doesn't even have the performance crown, by a long shot. NVIDIA did hold that pretty much non-stop since... Maxwell 2?

As of no difference in compute workloads, I can put Blender as an example: https://techgage.com/article/blender-2-91-best-cpus-gpus-for...

(6900 XT is 20.6 TFLOPS FP32, RTX 3090 is 35.6 TFLOPS FP32 _and_ also has the tensor cores, AMD can't win there because of sheer brute force)

And AMD's software stack removes them from consideration for a huge extent of compute workloads. If they aren't ready to have a proper one, it doesn't matter how good how much their hardware is good or not, it might as well be a brick.

shmerl · on March 2, 2021

I don't see Nvidia leading in raw compute power, both in GPGPU and in gaming. At most there will be parity going forward. Which will prompt Nvidia to double down on lock-ins like CUDA.

Either way, CUDA should be gone for good and replaced with something that's not tied to specific GPU. I welcome competition in hardware. I have no respect for lock-in shenanigans.

oivey · on March 2, 2021

They’re leading right now by a bit, and when it comes to workloads involving tensor cores or ray tracing they’re burying AMD.

I wish CUDA wasn’t vendor locked, too, but AMD and Intel need to start taking GPGPU software seriously. They’ve dug themselves a massive hole at this point.

shmerl · on March 2, 2021

I don't see a need to whitewash Nvidia here. They are nasty lock-in proponents. That doesn't mean AMD and Intel shouldn't invest more in improving the software side of things. Which I think they are doing (like the actual post example), so it should be getting better.