> There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)
Yes, you essentially need a (kind of) crossbar for shuffle and value broadcast.
But as far as I know there is no unit dedicated to this on Nvidia GPU.
However, depending on the GPU microarchitecture, shuffle and broadcast may be implemented differently (e.g. through the load/store units).
Note that I said "crossbar" for simplicity and because there is little information available, I doubt that all the paths really exist
Yes, you essentially need a (kind of) crossbar for shuffle and value broadcast. But as far as I know there is no unit dedicated to this on Nvidia GPU. However, depending on the GPU microarchitecture, shuffle and broadcast may be implemented differently (e.g. through the load/store units).
Note that I said "crossbar" for simplicity and because there is little information available, I doubt that all the paths really exist