Excellent Teardown by "Mysticial" from mersenneforum.org. Cliffnotes: \* Zen4 AV...

sitkack · on Sept 26, 2022

I think it is important to note that while double-pumped, using 512-bit registers puts lower pressure on decode and enables the pipelines to fill. So use 512-bit if you can.

adrian_b · on Sept 26, 2022

It should also be noted that believing that Zen 4 is "double-pumped" and the Intel CPUs are not "double-pumped" is completely misleading.

On most Intel CPUs with AVX-512 support, there are 2 classes of 512-bit instructions: instructions executed by combining a pair of 256-bit units, thus having an equal throughput for 512-bit instructions and 256-bit instructions, and the second class of instructions, which are executed by combining a pair of 256-bit execution units and also by extending to 512 bits another 256-bit execution unit.

For the second class of instructions the Intel CPUs have a throughput of two 512-bit instructions per cycle vs. three 256-bit instructions per cycle.

Compared to the cheaper models of Intel CPUs, Zen 4, while having the same throughput as Zen 3, i.e. two 512-bit instructions per cycle vs. four 256-bit instructions per cycle in Zen 3, either matches or exceeds the throughput of the Intel CPUs with AVX-512. Compared to the Intel CPUs, Zen 4 allows 1 FMA + 1 FADD, while on the Intel CPUs only 1 FMA per cycle can be executed.

The only important advantage of Intel appears in the most expensive models of the server and workstation CPUs, i.e. in most Xeon Gold, all Xeon Platinum and all of the Xeon W models that have AVX-512 support.

In these more expensive models, there is a second 512-bit FMA unit, which enables a double FMA throughput compared to Zen 4. These models with double FMA throughput are also helped by a double throughput for the loads from the L1 cache, which is matched to the FMA throughput.

So the AVX-512 implementation in Zen 4 is superior to that in the cheaper CPUs like Tiger Lake, even without taking into account the few new execution units added in Zen 4, like the 512-bit shuffle unit.

Only the Xeon Platinum and the like of the future Sapphire Rapids will have a definitely greater throughput for the floating-point operations than Zen 4, but they will also have a significantly lower all-clock frequency (due to the inferior manufacturing process), so the higher throughput per clock cycle is not certain to overcome the deficit in clock frequency.

BeeOnRope · on Sept 27, 2022

Yes, Intel also takes a less than "full" approach to moving from 256b to 512.

Though I think it is fair to say the Intel implementation represents kind of an intermediate state between the AMD approach (essentially no increase in execution or datapath resources outside of the shuffle) and simply extending everything 2x and a full doubling of every resource.

Essentially on SKX Intel chip behaves as if it had 2 full-width 512-bit execution ports: p01 (via fusion) and p5. For 256b it is three ports. Not all ports can do everything so the comparison is sometimes 3 vs 2 or 2 vs 1, but also sometimes 2 vs 2 (FMA operations on 2-FMA chips come to mind).

Critically, however, the load and store units were also extended to 512 bits: SKX can do 2x loads (1024 bits) and 1x store (512 bits) per cycle. This puts a hard cap on the performance of load and store heavy AVX methods, which does includes some fairly trivial but important integer operation loops like memcpy, memset and memchr type stuff which is fast enough to hit the load or store limits.

sitkack · on Sept 27, 2022

Neat! So does that mean a single thread can hit the membw limit using AVX512 for memcpy operations?

janwas · on Sept 27, 2022

Maxing out memBW requires multiple threads because Intel cores are relatively limited in line fill buffers. I've seen around 12 GB/s per SKX core with AVX-512.

BeeOnRope · on Sept 27, 2022

You usually don't even need AVX512 to sustain enough load/stores at the core to max out memory bandwidth "in theory": even with 256 bit loads and assuming 2/1 loads/stores per cycle (ICL/Zen 3 and newer can do more), that's 256 GB/s of read bandwidth or 128 GB/s write bandwidth (or both, at once!) at 4 GHz.

Indeed, you can reach these numbers if you always hit in L1 and come close if you always hit in L2. The load number especially is higher than almost any single socket bandwidth until perhaps very recently*: an 8-channel chip with the fastest DDR4-3200 would get 25.6 x 8 = 204.8 GB/s max theoretical bandwidth. Most chips have fewer channels and lower max theoretical bandwidth.

However, and as a sibling comment alludes to, you generally cannot in practice sustain enough outstanding misses from a single core to actually achieve this number. E.g., with 16 outstanding misses and 100 ns latency per cache line you can only demand fetch at ~10 GB/s from on core. Actually numbers are higher due to prefetching, which both decreases the latency (since the prefetch is initiated from a component closer to the memory controller) and makes more outstanding misses available (since there are more miss buffers from L2 than there are from the core) but this to only roughly double the bandwidth: it's hard to get more than 20-30 GB/s from a single core on Intel.

This isn't a fundamental limitation which applies to every CPU however: Apple chips can extract the entire bandwidth from a single core, despite having much smaller 128-bit (perhaps 256-bit if you consider load pair) load and store instructions.

---

* Not really sure about this one: are there 16-channel DDR5 setups out there yet (16 DDR5 channels corresponds to 8 independent DIMMS so is similar to an 8-channel DDR4 setup as DDR5 has 2x channels per DIMM)?

celrod · on Sept 26, 2022

Yeah, the claim was that this is why it hit higher clock speeds. The front end will be hard pressed to hit/maintian 4 IPC, while 2 IPC is much easier.

janwas · on Sept 26, 2022

Indeed a great article, well worth reading in full for anyone who uses AVX-512.

Two other things that jumped out at me: VPCONFLICT is 10x as fast, compressstoreu is >10x slower. Those might be enough to warrant a Zen4-specific codepath in Highway.

celrod · on Sept 26, 2022

The Intel optimization manual has a fun example where they use vpconflict for vectorizing sparse dot products: https://github.com/intel/optimization-manual/blob/main/chap1...

I benchmarked it on Intel, and it was indeed quite fast/a good improvement over the scalar version. Will be interesting to try that on AMD.

janwas · on Sept 27, 2022

Nice! Thanks for linking it :)

celrod · on Sept 26, 2022

Looks like SIMD implementations that use LUTs should favor small tables that fit in registers and use `vperm2ipd` as look ups over larger tables + gather.

With 64 bits, you still get a LUT size of 16 (shuffle indexes into two 8xdouble vectors), which can be good enough for functions like log and exp.

Const-me · on Sept 27, 2022

Loading data from random memory locations became too expensive compared to computations. For log, exp, trigonometry, and similar, I think people rarely use any lookup tables. Instead, they use some high-poly approximations, and for log/exp abuse IEEE binary floats representation.

Here's a log() function from the standard library in OpenBSD: https://github.com/openbsd/src/blob/master/lib/libm/src/e_lo...

celrod · on Sept 27, 2022

LUTs at least do well in microbenchmarks, but I do worry that they may do comparatively much worse in real code. That said, that's another advantage of small tables using vpermi2pd.

The Julia/base implementations of log and exp both use LUTs. The SIMD AVX512 implementation of exp used by LoopVectorization.jl will sometimes use the 16 element table. I experimented with log, but had some difficulty getting accuracy and performance, so the version LoopVectorization.jl currently uses doesn't use a table.

Const-me · on Sept 27, 2022

BTW, since you apparently working on the stuff like that, check out that repository:

https://github.com/Const-me/AvxMath/blob/master/AvxMath/AvxM...

The license is MIT, copy-paste friendly. It doesn’t use AVX512 though, only AVX1 and optionally 2.

daniel-cussen · on Sept 26, 2022

Shuffle is the SIMD's killer app. It's apparently an interesting but expensive circuit, but it's smart to prioritize it. Absolute best instruction, hands down. So double-pumping yes isn't full speed meaning single cycle, but that increases the compatibility with AVX512 code. I guess if a program executed itself as a function of its runtime from CPUID it might not, and of course there's all kinds of...but for pedestrian purposes, meaning everything on github, it's a step. Hey 40% speedup on Cinebench, that's buen.

dragontamer · on Sept 26, 2022

> Shuffle is the SIMD's killer app

A shame that AVX512 only has pshufb (aka: permute), and is missing the GPU-instruction "bpermute", aka backwards permute.

pshufb is effectively a "gather" instruction over a AVX register. Equivalent to GPU permutes.

bpermute, in GPU land, is a "scatter" instruction over a vector register. There's no CPU / AVX equivalent of it. But I keep coming up with good uses of the bpermute instruction (much like pshufb is crazy flexible, its inverse, the backwards permute, is also crazy flexible).

--------

Almost any code that's finding itself "gathering" data across a vector register, will inevitably "scatter" the data back at some point.

Much like how "pext" is the "gather" instruction for 64-bits, you need pdep to handle the equal-and-opposite case. Its incredibly silly that AVX / AVX512 has implemented only one-half of this concept (gather / pshufb / aka Permute).

I wish for the day that Intel/AMD implements (scatter / backwards-pshufb / aka Backwards-Permute).

-------

Fortunately, I got Vega64 and NVidia Graphics Cards with both permute and bpermute instructions for high-speed shuffling of data. But CPU-space should benefit from this concept too.

daniel-cussen · on Sept 26, 2022

OK that's cool, didn't know about bpermute. Made sense there should be a counterpart. Well when you only have pshufb, it works OK, yeah there's tons of gaps but if you're clever and...and if you compromise speed...thanks for telling me about bpermute!

giyanani · on Sept 26, 2022

Why do you say shuffle is “SIMD’s killer app”? I’ve only dabbled in vector instructions from a learning perspective, and seen others mention it’s important too, but have yet to understand why.

Veliladon · on Sept 26, 2022

Because you can do things using bitmasks and single instructions instead of brute forcing using multiple instructions.

Let's say you have a whole heap of 8-bit numbers you want to multiply by 2 and you have a set of 256-bit registers and a nice SIMD multiply command. If you don't have a shuffle you need to assemble your series of 2s for the second operand for each lane before you can even start. This is going to take hundreds of instructions and hundreds of clocks. Shuffle means you load up lane 0 with the "2" and then splat the contents of lane 0 across the other 31 lanes in two instructions and a few clocks using the shuffle unit.

N.B. Shuffle isn't just about splatting. There's a whole heap of different operations it can do that are useful. I just picked a simple example with an obvious massive performance increase for illustrative purposes.

stabbles · on Sept 26, 2022

You're not talking about shuffle, you're talking about broadcast. Shuffle instructions is where you take one or two vectors, and output a third with elements from any index of the input. So for example `out = [in[2], in[1]]` is a shuffle of a vector of length 2.

It's useful for example if you have say RGB color data stored contiguously in memory as say RGBRGBRGBRGB..., and you want to vectorize operations on R, B and G separately. You can load a few registers like [RGBR][GBRG][BRGB], and then shuffle them to [RRRR][BBBB][GGGG]. In fact it's not entirely trivial how to shuffle optimally, it takes a few shuffles to get there.

More generally, if you have an array of structs, you often need to go to struct of arrays to do vectorized operations on the array, before returning to an array of struct again.

Another example is fast matrix transpose (in fact you can think of the RGB example a 3 by N matrix transpose to N by 3, where N is the vector width -- AoS -> SoA is a transpose too, in a sense). Suppose you have a matrix of size N by N where N is the vector width, you need N lg N shuffles to transpose the matrix.

Dylan16807 · on Sept 26, 2022

I think that example is too simple to show the benefit of shuffle. It's like explaining the benefit of an adder by showing how you can move a value with X = Y + 0. Especially since there's also a (much simpler) piece of hardware dedicated to ultra-fast splat/broadcast (under the right conditions).

magicalhippo · on Sept 26, 2022

It's basically several moves for the price of one. Given that you operate on multiple values at once, being able to shuffle or duplicate values comes up all the time.

For example if you're filtering four image lines at a time using a 1D filter kernel, you'll want to replicate the filter coefficient to each SIMD element, so that you can multiply each of the four pixel values with the same coefficient. Shuffle lets you replicate a single coefficient value into all the elements of a register in one instruction.

daniel-cussen · on Sept 26, 2022

Which is the point of SIMD. Several moves for the price of one.

MrBuddyCasino · on Sept 26, 2022

This Rust issue [0] was the best short summary of what an SIMD Shuffle is I could find:

„A "shuffle", in SIMD terms, takes a SIMD vector (or possibly two vectors) and a pattern of source lane indexes (usually as an immediate), and then produces a new SIMD vector where the output is the source lane values in the pattern given.“

[0] https://github.com/rust-lang/portable-simd/issues/11

demindiro · on Sept 26, 2022

I use PSHUFB to convert 24-bit RGB to 32-bit RGBX or BGRX. Without a shuffle instruction it'd be quite a bit harder.

zX41ZdbW · on Sept 26, 2022

Here is an overview of the usage of the shuffle instruction to speed up decompression in ClickHouse: https://habr.com/ru/company/yandex/blog/457612/