CUDA has had asynchronous command lists (multiple, actually, via streams) foreve...

monocasa · on March 1, 2021

"Asynchronous shaders" doesn't mean that host code is unblocked on job dispatch (that's been the case for nearly all GPU APIs for going on 40 years). It's having multiple hardware queues with different priorities to be able to have other work to do when a kernel/shader is only using a subset of the SMs/CUs. So it's a hardware balancing feature for non heterogeneous workloads.

Nvidia hardware won't dispatch multiple kernels simultaneously, all of the streams stuff you're seeing is hints to the user space driver in how it orders the work in it's single hardware command queue.

oivey · on March 2, 2021

Is that not what this is? https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

I thought this allowed kernels to run in parallel if both kernels could fit in the open resources. It doesn't have a priority system, and I'm guessing isn't smart enough to reorder kernels to fit together better. Further Googling to be sure that they mean parallel AND concurrent didn't turn up much, but Nvidia does appear to mean both.

my123 · on March 2, 2021

cudaStreamCreateWithPriority is what you'd like to use for creating a stream with a specifically defined priority.

The GPU is able to run up to 128 kernels in parallel on modern NVIDIA GPUs, concurrently.

oivey · on March 2, 2021

Ahh ok. I’ve had a similar conversation with a graphics guy that strongly paralleled this conversation, and I was confused then, too. Seemingly the asynchronous compute issues are a limitation when it only comes to shaders I guess? I write some CUDA but no shaders, and certainly not so much that I’d make super concrete claims about more esoteric features like streams.