CUDA has had asynchronous command lists (multiple, actually, via streams) forever. What does AMD’s special hardware enable? Is it exposed in ROCm somehow?
Edit: Maybe I should be more precise with my language. Using CUDA streams, you can queue up and execute multiple kernels (basically commands?) in parallel. Kernels invoked on the same stream are serialized. I have no idea what the exact analog is in the compute shader world.
"Asynchronous shaders" doesn't mean that host code is unblocked on job dispatch (that's been the case for nearly all GPU APIs for going on 40 years). It's having multiple hardware queues with different priorities to be able to have other work to do when a kernel/shader is only using a subset of the SMs/CUs. So it's a hardware balancing feature for non heterogeneous workloads.
Nvidia hardware won't dispatch multiple kernels simultaneously, all of the streams stuff you're seeing is hints to the user space driver in how it orders the work in it's single hardware command queue.
I thought this allowed kernels to run in parallel if both kernels could fit in the open resources. It doesn't have a priority system, and I'm guessing isn't smart enough to reorder kernels to fit together better. Further Googling to be sure that they mean parallel AND concurrent didn't turn up much, but Nvidia does appear to mean both.
Ahh ok. I’ve had a similar conversation with a graphics guy that strongly paralleled this conversation, and I was confused then, too. Seemingly the asynchronous compute issues are a limitation when it only comes to shaders I guess? I write some CUDA but no shaders, and certainly not so much that I’d make super concrete claims about more esoteric features like streams.
Edit: Maybe I should be more precise with my language. Using CUDA streams, you can queue up and execute multiple kernels (basically commands?) in parallel. Kernels invoked on the same stream are serialized. I have no idea what the exact analog is in the compute shader world.