Thanks for the enlightenment here, some of those I didn't know. Being used to explicit vectors, going from avx512 proficiency, pdep/pext, gfni rabbit holes, I kind of forgot all of that when going to cuda, trying to avoid the trap of 'doing C in Rust (or Ada, hopefully the meaning is clear)', jumped to new idioms and I must say most of those you cited never appeared in most high performance code I've read, and some I only saw perusing ptx and lower level compiled code, which... I was never sure nvidia would maintain over time. Seems for us it's cub, barriers, atomics and ballots.
It seems I have lots of reading to do and lots of ways to improve my sorting networks / counting sort implementations.
> I was never sure nvidia would maintain over time.
PTX is maintained over time. Its a high-level assembly so to speak, the full details of the machine remain abstracted so that code can be more portable.
SASS is not. SASS changes from architecture-to-architecture. SASS is the actual machine code of NVidia cards. There's an overall understanding of SASS in the GPU world but its not really documented and you "shouldn't" want to learn about it.
--------
I should note that Intel's "pshufb" instruction is very similar to the permute instruction in NVidia/AMD. So yeah, there's a high-speed generic shuffle that's key to Intel/AMD AVX512 code.
But having the backwards-direction (bpermute) available too, as well as __shared__ memory for all other cases is great.
It seems I have lots of reading to do and lots of ways to improve my sorting networks / counting sort implementations.
Thanks again.