I think it is important to note that while double-pumped, using 512-bit registers puts lower pressure on decode and enables the pipelines to fill. So use 512-bit if you can.
It should also be noted that believing that Zen 4 is "double-pumped" and the Intel CPUs are not "double-pumped" is completely misleading.
On most Intel CPUs with AVX-512 support, there are 2 classes of 512-bit instructions: instructions executed by combining a pair of 256-bit units, thus having an equal throughput for 512-bit instructions and 256-bit instructions, and the second class of instructions, which are executed by combining a pair of 256-bit execution units and also by extending to 512 bits another 256-bit execution unit.
For the second class of instructions the Intel CPUs have a throughput of two 512-bit instructions per cycle vs. three 256-bit instructions per cycle.
Compared to the cheaper models of Intel CPUs, Zen 4, while having the same throughput as Zen 3, i.e. two 512-bit instructions per cycle vs. four 256-bit instructions per cycle in Zen 3, either matches or exceeds the throughput of the Intel CPUs with AVX-512. Compared to the Intel CPUs, Zen 4 allows 1 FMA + 1 FADD, while on the Intel CPUs only 1 FMA per cycle can be executed.
The only important advantage of Intel appears in the most expensive models of the server and workstation CPUs, i.e. in most Xeon Gold, all Xeon Platinum and all of the Xeon W models that have AVX-512 support.
In these more expensive models, there is a second 512-bit FMA unit, which enables a double FMA throughput compared to Zen 4. These models with double FMA throughput are also helped by a double throughput for the loads from the L1 cache, which is matched to the FMA throughput.
So the AVX-512 implementation in Zen 4 is superior to that in the cheaper CPUs like Tiger Lake, even without taking into account the few new execution units added in Zen 4, like the 512-bit shuffle unit.
Only the Xeon Platinum and the like of the future Sapphire Rapids will have a definitely greater throughput for the floating-point operations than Zen 4, but they will also have a significantly lower all-clock frequency (due to the inferior manufacturing process), so the higher throughput per clock cycle is not certain to overcome the deficit in clock frequency.
Yes, Intel also takes a less than "full" approach to moving from 256b to 512.
Though I think it is fair to say the Intel implementation represents kind of an intermediate state between the AMD approach (essentially no increase in execution or datapath resources outside of the shuffle) and simply extending everything 2x and a full doubling of every resource.
Essentially on SKX Intel chip behaves as if it had 2 full-width 512-bit execution ports: p01 (via fusion) and p5. For 256b it is three ports. Not all ports can do everything so the comparison is sometimes 3 vs 2 or 2 vs 1, but also sometimes 2 vs 2 (FMA operations on 2-FMA chips come to mind).
Critically, however, the load and store units were also extended to 512 bits: SKX can do 2x loads (1024 bits) and 1x store (512 bits) per cycle. This puts a hard cap on the performance of load and store heavy AVX methods, which does includes some fairly trivial but important integer operation loops like memcpy, memset and memchr type stuff which is fast enough to hit the load or store limits.
Maxing out memBW requires multiple threads because Intel cores are relatively limited in line fill buffers. I've seen around 12 GB/s per SKX core with AVX-512.
You usually don't even need AVX512 to sustain enough load/stores at the core to max out memory bandwidth "in theory": even with 256 bit loads and assuming 2/1 loads/stores per cycle (ICL/Zen 3 and newer can do more), that's 256 GB/s of read bandwidth or 128 GB/s write bandwidth (or both, at once!) at 4 GHz.
Indeed, you can reach these numbers if you always hit in L1 and come close if you always hit in L2. The load number especially is higher than almost any single socket bandwidth until perhaps very recently*: an 8-channel chip with the fastest DDR4-3200 would get 25.6 x 8 = 204.8 GB/s max theoretical bandwidth. Most chips have fewer channels and lower max theoretical bandwidth.
However, and as a sibling comment alludes to, you generally cannot in practice sustain enough outstanding misses from a single core to actually achieve this number. E.g., with 16 outstanding misses and 100 ns latency per cache line you can only demand fetch at ~10 GB/s from on core. Actually numbers are higher due to prefetching, which both decreases the latency (since the prefetch is initiated from a component closer to the memory controller) and makes more outstanding misses available (since there are more miss buffers from L2 than there are from the core) but this to only roughly double the bandwidth: it's hard to get more than 20-30 GB/s from a single core on Intel.
This isn't a fundamental limitation which applies to every CPU however: Apple chips can extract the entire bandwidth from a single core, despite having much smaller 128-bit (perhaps 256-bit if you consider load pair) load and store instructions.
---
* Not really sure about this one: are there 16-channel DDR5 setups out there yet (16 DDR5 channels corresponds to 8 independent DIMMS so is similar to an 8-channel DDR4 setup as DDR5 has 2x channels per DIMM)?
Indeed a great article, well worth reading in full for anyone who uses AVX-512.
Two other things that jumped out at me: VPCONFLICT is 10x as fast, compressstoreu is >10x slower. Those might be enough to warrant a Zen4-specific codepath in Highway.
Looks like SIMD implementations that use LUTs should favor small tables that fit in registers and use `vperm2ipd` as look ups over larger tables + gather.
With 64 bits, you still get a LUT size of 16 (shuffle indexes into two 8xdouble vectors), which can be good enough for functions like log and exp.
Loading data from random memory locations became too expensive compared to computations. For log, exp, trigonometry, and similar, I think people rarely use any lookup tables. Instead, they use some high-poly approximations, and for log/exp abuse IEEE binary floats representation.
LUTs at least do well in microbenchmarks, but I do worry that they may do comparatively much worse in real code.
That said, that's another advantage of small tables using vpermi2pd.
The Julia/base implementations of log and exp both use LUTs.
The SIMD AVX512 implementation of exp used by LoopVectorization.jl will sometimes use the 16 element table.
I experimented with log, but had some difficulty getting accuracy and performance, so the version LoopVectorization.jl currently uses doesn't use a table.
Shuffle is the SIMD's killer app. It's apparently an interesting but expensive circuit, but it's smart to prioritize it. Absolute best instruction, hands down. So double-pumping yes isn't full speed meaning single cycle, but that increases the compatibility with AVX512 code. I guess if a program executed itself as a function of its runtime from CPUID it might not, and of course there's all kinds of...but for pedestrian purposes, meaning everything on github, it's a step. Hey 40% speedup on Cinebench, that's buen.
A shame that AVX512 only has pshufb (aka: permute), and is missing the GPU-instruction "bpermute", aka backwards permute.
pshufb is effectively a "gather" instruction over a AVX register. Equivalent to GPU permutes.
bpermute, in GPU land, is a "scatter" instruction over a vector register. There's no CPU / AVX equivalent of it. But I keep coming up with good uses of the bpermute instruction (much like pshufb is crazy flexible, its inverse, the backwards permute, is also crazy flexible).
--------
Almost any code that's finding itself "gathering" data across a vector register, will inevitably "scatter" the data back at some point.
Much like how "pext" is the "gather" instruction for 64-bits, you need pdep to handle the equal-and-opposite case. Its incredibly silly that AVX / AVX512 has implemented only one-half of this concept (gather / pshufb / aka Permute).
I wish for the day that Intel/AMD implements (scatter / backwards-pshufb / aka Backwards-Permute).
-------
Fortunately, I got Vega64 and NVidia Graphics Cards with both permute and bpermute instructions for high-speed shuffling of data. But CPU-space should benefit from this concept too.
OK that's cool, didn't know about bpermute. Made sense there should be a counterpart. Well when you only have pshufb, it works OK, yeah there's tons of gaps but if you're clever and...and if you compromise speed...thanks for telling me about bpermute!
Why do you say shuffle is “SIMD’s killer app”? I’ve only dabbled in vector instructions from a learning perspective, and seen others mention it’s important too, but have yet to understand why.
Because you can do things using bitmasks and single instructions instead of brute forcing using multiple instructions.
Let's say you have a whole heap of 8-bit numbers you want to multiply by 2 and you have a set of 256-bit registers and a nice SIMD multiply command. If you don't have a shuffle you need to assemble your series of 2s for the second operand for each lane before you can even start. This is going to take hundreds of instructions and hundreds of clocks. Shuffle means you load up lane 0 with the "2" and then splat the contents of lane 0 across the other 31 lanes in two instructions and a few clocks using the shuffle unit.
N.B. Shuffle isn't just about splatting. There's a whole heap of different operations it can do that are useful. I just picked a simple example with an obvious massive performance increase for illustrative purposes.
You're not talking about shuffle, you're talking about broadcast. Shuffle instructions is where you take one or two vectors, and output a third with elements from any index of the input. So for example `out = [in[2], in[1]]` is a shuffle of a vector of length 2.
It's useful for example if you have say RGB color data stored contiguously in memory as say RGBRGBRGBRGB..., and you want to vectorize operations on R, B and G separately. You can load a few registers like [RGBR][GBRG][BRGB], and then shuffle them to [RRRR][BBBB][GGGG]. In fact it's not entirely trivial how to shuffle optimally, it takes a few shuffles to get there.
More generally, if you have an array of structs, you often need to go to struct of arrays to do vectorized operations on the array, before returning to an array of struct again.
Another example is fast matrix transpose (in fact you can think of the RGB example a 3 by N matrix transpose to N by 3, where N is the vector width -- AoS -> SoA is a transpose too, in a sense). Suppose you have a matrix of size N by N where N is the vector width, you need N lg N shuffles to transpose the matrix.
I think that example is too simple to show the benefit of shuffle. It's like explaining the benefit of an adder by showing how you can move a value with X = Y + 0. Especially since there's also a (much simpler) piece of hardware dedicated to ultra-fast splat/broadcast (under the right conditions).
It's basically several moves for the price of one. Given that you operate on multiple values at once, being able to shuffle or duplicate values comes up all the time.
For example if you're filtering four image lines at a time using a 1D filter kernel, you'll want to replicate the filter coefficient to each SIMD element, so that you can multiply each of the four pixel values with the same coefficient. Shuffle lets you replicate a single coefficient value into all the elements of a register in one instruction.
This Rust issue [0] was the best short summary of what an SIMD Shuffle is I could find:
„A "shuffle", in SIMD terms, takes a SIMD vector (or possibly two vectors) and a pattern of source lane indexes (usually as an immediate), and then produces a new SIMD vector where the output is the source lane values in the pattern given.“
Cliffnotes:
* Zen4 AVX512 is mostly double-pumped: a 256-bit native hardware that processes two halves of the 512-bit register.
* No throttling observed
* 512-bit shuffle pipeline (!!). A powerful exception to the "double-pumping" found in most other AVX512 instructions.
* AMD seemingly handles the AVX512 mask registers better than Intel.
* Gather/Scatter slow on AMD's Zen4 implementation.
* Intel's 512-bit native load/store unit has clear advantages over AMD's 256-bit load-store unit when reading/writing to L1 cache and beyond.