A shame that AVX512 only has pshufb (aka: permute), and is missing the GPU-instruction "bpermute", aka backwards permute.
pshufb is effectively a "gather" instruction over a AVX register. Equivalent to GPU permutes.
bpermute, in GPU land, is a "scatter" instruction over a vector register. There's no CPU / AVX equivalent of it. But I keep coming up with good uses of the bpermute instruction (much like pshufb is crazy flexible, its inverse, the backwards permute, is also crazy flexible).
--------
Almost any code that's finding itself "gathering" data across a vector register, will inevitably "scatter" the data back at some point.
Much like how "pext" is the "gather" instruction for 64-bits, you need pdep to handle the equal-and-opposite case. Its incredibly silly that AVX / AVX512 has implemented only one-half of this concept (gather / pshufb / aka Permute).
I wish for the day that Intel/AMD implements (scatter / backwards-pshufb / aka Backwards-Permute).
-------
Fortunately, I got Vega64 and NVidia Graphics Cards with both permute and bpermute instructions for high-speed shuffling of data. But CPU-space should benefit from this concept too.
OK that's cool, didn't know about bpermute. Made sense there should be a counterpart. Well when you only have pshufb, it works OK, yeah there's tons of gaps but if you're clever and...and if you compromise speed...thanks for telling me about bpermute!
A shame that AVX512 only has pshufb (aka: permute), and is missing the GPU-instruction "bpermute", aka backwards permute.
pshufb is effectively a "gather" instruction over a AVX register. Equivalent to GPU permutes.
bpermute, in GPU land, is a "scatter" instruction over a vector register. There's no CPU / AVX equivalent of it. But I keep coming up with good uses of the bpermute instruction (much like pshufb is crazy flexible, its inverse, the backwards permute, is also crazy flexible).
--------
Almost any code that's finding itself "gathering" data across a vector register, will inevitably "scatter" the data back at some point.
Much like how "pext" is the "gather" instruction for 64-bits, you need pdep to handle the equal-and-opposite case. Its incredibly silly that AVX / AVX512 has implemented only one-half of this concept (gather / pshufb / aka Permute).
I wish for the day that Intel/AMD implements (scatter / backwards-pshufb / aka Backwards-Permute).
-------
Fortunately, I got Vega64 and NVidia Graphics Cards with both permute and bpermute instructions for high-speed shuffling of data. But CPU-space should benefit from this concept too.