Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Zen4's AVX512 Teardown (mersenneforum.org)
415 points by dragontamer on Sept 26, 2022 | hide | past | favorite | 97 comments



Excellent Teardown by "Mysticial" from mersenneforum.org.

Cliffnotes:

* Zen4 AVX512 is mostly double-pumped: a 256-bit native hardware that processes two halves of the 512-bit register.

* No throttling observed

* 512-bit shuffle pipeline (!!). A powerful exception to the "double-pumping" found in most other AVX512 instructions.

* AMD seemingly handles the AVX512 mask registers better than Intel.

* Gather/Scatter slow on AMD's Zen4 implementation.

* Intel's 512-bit native load/store unit has clear advantages over AMD's 256-bit load-store unit when reading/writing to L1 cache and beyond.


I think it is important to note that while double-pumped, using 512-bit registers puts lower pressure on decode and enables the pipelines to fill. So use 512-bit if you can.


It should also be noted that believing that Zen 4 is "double-pumped" and the Intel CPUs are not "double-pumped" is completely misleading.

On most Intel CPUs with AVX-512 support, there are 2 classes of 512-bit instructions: instructions executed by combining a pair of 256-bit units, thus having an equal throughput for 512-bit instructions and 256-bit instructions, and the second class of instructions, which are executed by combining a pair of 256-bit execution units and also by extending to 512 bits another 256-bit execution unit.

For the second class of instructions the Intel CPUs have a throughput of two 512-bit instructions per cycle vs. three 256-bit instructions per cycle.

Compared to the cheaper models of Intel CPUs, Zen 4, while having the same throughput as Zen 3, i.e. two 512-bit instructions per cycle vs. four 256-bit instructions per cycle in Zen 3, either matches or exceeds the throughput of the Intel CPUs with AVX-512. Compared to the Intel CPUs, Zen 4 allows 1 FMA + 1 FADD, while on the Intel CPUs only 1 FMA per cycle can be executed.

The only important advantage of Intel appears in the most expensive models of the server and workstation CPUs, i.e. in most Xeon Gold, all Xeon Platinum and all of the Xeon W models that have AVX-512 support.

In these more expensive models, there is a second 512-bit FMA unit, which enables a double FMA throughput compared to Zen 4. These models with double FMA throughput are also helped by a double throughput for the loads from the L1 cache, which is matched to the FMA throughput.

So the AVX-512 implementation in Zen 4 is superior to that in the cheaper CPUs like Tiger Lake, even without taking into account the few new execution units added in Zen 4, like the 512-bit shuffle unit.

Only the Xeon Platinum and the like of the future Sapphire Rapids will have a definitely greater throughput for the floating-point operations than Zen 4, but they will also have a significantly lower all-clock frequency (due to the inferior manufacturing process), so the higher throughput per clock cycle is not certain to overcome the deficit in clock frequency.


Yes, Intel also takes a less than "full" approach to moving from 256b to 512.

Though I think it is fair to say the Intel implementation represents kind of an intermediate state between the AMD approach (essentially no increase in execution or datapath resources outside of the shuffle) and simply extending everything 2x and a full doubling of every resource.

Essentially on SKX Intel chip behaves as if it had 2 full-width 512-bit execution ports: p01 (via fusion) and p5. For 256b it is three ports. Not all ports can do everything so the comparison is sometimes 3 vs 2 or 2 vs 1, but also sometimes 2 vs 2 (FMA operations on 2-FMA chips come to mind).

Critically, however, the load and store units were also extended to 512 bits: SKX can do 2x loads (1024 bits) and 1x store (512 bits) per cycle. This puts a hard cap on the performance of load and store heavy AVX methods, which does includes some fairly trivial but important integer operation loops like memcpy, memset and memchr type stuff which is fast enough to hit the load or store limits.


Neat! So does that mean a single thread can hit the membw limit using AVX512 for memcpy operations?


Maxing out memBW requires multiple threads because Intel cores are relatively limited in line fill buffers. I've seen around 12 GB/s per SKX core with AVX-512.


You usually don't even need AVX512 to sustain enough load/stores at the core to max out memory bandwidth "in theory": even with 256 bit loads and assuming 2/1 loads/stores per cycle (ICL/Zen 3 and newer can do more), that's 256 GB/s of read bandwidth or 128 GB/s write bandwidth (or both, at once!) at 4 GHz.

Indeed, you can reach these numbers if you always hit in L1 and come close if you always hit in L2. The load number especially is higher than almost any single socket bandwidth until perhaps very recently*: an 8-channel chip with the fastest DDR4-3200 would get 25.6 x 8 = 204.8 GB/s max theoretical bandwidth. Most chips have fewer channels and lower max theoretical bandwidth.

However, and as a sibling comment alludes to, you generally cannot in practice sustain enough outstanding misses from a single core to actually achieve this number. E.g., with 16 outstanding misses and 100 ns latency per cache line you can only demand fetch at ~10 GB/s from on core. Actually numbers are higher due to prefetching, which both decreases the latency (since the prefetch is initiated from a component closer to the memory controller) and makes more outstanding misses available (since there are more miss buffers from L2 than there are from the core) but this to only roughly double the bandwidth: it's hard to get more than 20-30 GB/s from a single core on Intel.

This isn't a fundamental limitation which applies to every CPU however: Apple chips can extract the entire bandwidth from a single core, despite having much smaller 128-bit (perhaps 256-bit if you consider load pair) load and store instructions.

---

* Not really sure about this one: are there 16-channel DDR5 setups out there yet (16 DDR5 channels corresponds to 8 independent DIMMS so is similar to an 8-channel DDR4 setup as DDR5 has 2x channels per DIMM)?


Yeah, the claim was that this is why it hit higher clock speeds. The front end will be hard pressed to hit/maintian 4 IPC, while 2 IPC is much easier.


Indeed a great article, well worth reading in full for anyone who uses AVX-512.

Two other things that jumped out at me: VPCONFLICT is 10x as fast, compressstoreu is >10x slower. Those might be enough to warrant a Zen4-specific codepath in Highway.


The Intel optimization manual has a fun example where they use vpconflict for vectorizing sparse dot products: https://github.com/intel/optimization-manual/blob/main/chap1...

I benchmarked it on Intel, and it was indeed quite fast/a good improvement over the scalar version. Will be interesting to try that on AMD.


Nice! Thanks for linking it :)


Looks like SIMD implementations that use LUTs should favor small tables that fit in registers and use `vperm2ipd` as look ups over larger tables + gather.

With 64 bits, you still get a LUT size of 16 (shuffle indexes into two 8xdouble vectors), which can be good enough for functions like log and exp.


Loading data from random memory locations became too expensive compared to computations. For log, exp, trigonometry, and similar, I think people rarely use any lookup tables. Instead, they use some high-poly approximations, and for log/exp abuse IEEE binary floats representation.

Here's a log() function from the standard library in OpenBSD: https://github.com/openbsd/src/blob/master/lib/libm/src/e_lo...


LUTs at least do well in microbenchmarks, but I do worry that they may do comparatively much worse in real code. That said, that's another advantage of small tables using vpermi2pd.

The Julia/base implementations of log and exp both use LUTs. The SIMD AVX512 implementation of exp used by LoopVectorization.jl will sometimes use the 16 element table. I experimented with log, but had some difficulty getting accuracy and performance, so the version LoopVectorization.jl currently uses doesn't use a table.


BTW, since you apparently working on the stuff like that, check out that repository:

https://github.com/Const-me/AvxMath/blob/master/AvxMath/AvxM...

The license is MIT, copy-paste friendly. It doesn’t use AVX512 though, only AVX1 and optionally 2.


Shuffle is the SIMD's killer app. It's apparently an interesting but expensive circuit, but it's smart to prioritize it. Absolute best instruction, hands down. So double-pumping yes isn't full speed meaning single cycle, but that increases the compatibility with AVX512 code. I guess if a program executed itself as a function of its runtime from CPUID it might not, and of course there's all kinds of...but for pedestrian purposes, meaning everything on github, it's a step. Hey 40% speedup on Cinebench, that's buen.


> Shuffle is the SIMD's killer app

A shame that AVX512 only has pshufb (aka: permute), and is missing the GPU-instruction "bpermute", aka backwards permute.

pshufb is effectively a "gather" instruction over a AVX register. Equivalent to GPU permutes.

bpermute, in GPU land, is a "scatter" instruction over a vector register. There's no CPU / AVX equivalent of it. But I keep coming up with good uses of the bpermute instruction (much like pshufb is crazy flexible, its inverse, the backwards permute, is also crazy flexible).

--------

Almost any code that's finding itself "gathering" data across a vector register, will inevitably "scatter" the data back at some point.

Much like how "pext" is the "gather" instruction for 64-bits, you need pdep to handle the equal-and-opposite case. Its incredibly silly that AVX / AVX512 has implemented only one-half of this concept (gather / pshufb / aka Permute).

I wish for the day that Intel/AMD implements (scatter / backwards-pshufb / aka Backwards-Permute).

-------

Fortunately, I got Vega64 and NVidia Graphics Cards with both permute and bpermute instructions for high-speed shuffling of data. But CPU-space should benefit from this concept too.


OK that's cool, didn't know about bpermute. Made sense there should be a counterpart. Well when you only have pshufb, it works OK, yeah there's tons of gaps but if you're clever and...and if you compromise speed...thanks for telling me about bpermute!


Why do you say shuffle is “SIMD’s killer app”? I’ve only dabbled in vector instructions from a learning perspective, and seen others mention it’s important too, but have yet to understand why.


Because you can do things using bitmasks and single instructions instead of brute forcing using multiple instructions.

Let's say you have a whole heap of 8-bit numbers you want to multiply by 2 and you have a set of 256-bit registers and a nice SIMD multiply command. If you don't have a shuffle you need to assemble your series of 2s for the second operand for each lane before you can even start. This is going to take hundreds of instructions and hundreds of clocks. Shuffle means you load up lane 0 with the "2" and then splat the contents of lane 0 across the other 31 lanes in two instructions and a few clocks using the shuffle unit.

N.B. Shuffle isn't just about splatting. There's a whole heap of different operations it can do that are useful. I just picked a simple example with an obvious massive performance increase for illustrative purposes.


You're not talking about shuffle, you're talking about broadcast. Shuffle instructions is where you take one or two vectors, and output a third with elements from any index of the input. So for example `out = [in[2], in[1]]` is a shuffle of a vector of length 2.

It's useful for example if you have say RGB color data stored contiguously in memory as say RGBRGBRGBRGB..., and you want to vectorize operations on R, B and G separately. You can load a few registers like [RGBR][GBRG][BRGB], and then shuffle them to [RRRR][BBBB][GGGG]. In fact it's not entirely trivial how to shuffle optimally, it takes a few shuffles to get there.

More generally, if you have an array of structs, you often need to go to struct of arrays to do vectorized operations on the array, before returning to an array of struct again.

Another example is fast matrix transpose (in fact you can think of the RGB example a 3 by N matrix transpose to N by 3, where N is the vector width -- AoS -> SoA is a transpose too, in a sense). Suppose you have a matrix of size N by N where N is the vector width, you need N lg N shuffles to transpose the matrix.


I think that example is too simple to show the benefit of shuffle. It's like explaining the benefit of an adder by showing how you can move a value with X = Y + 0. Especially since there's also a (much simpler) piece of hardware dedicated to ultra-fast splat/broadcast (under the right conditions).


It's basically several moves for the price of one. Given that you operate on multiple values at once, being able to shuffle or duplicate values comes up all the time.

For example if you're filtering four image lines at a time using a 1D filter kernel, you'll want to replicate the filter coefficient to each SIMD element, so that you can multiply each of the four pixel values with the same coefficient. Shuffle lets you replicate a single coefficient value into all the elements of a register in one instruction.


Which is the point of SIMD. Several moves for the price of one.


This Rust issue [0] was the best short summary of what an SIMD Shuffle is I could find:

„A "shuffle", in SIMD terms, takes a SIMD vector (or possibly two vectors) and a pattern of source lane indexes (usually as an immediate), and then produces a new SIMD vector where the output is the source lane values in the pattern given.“

[0] https://github.com/rust-lang/portable-simd/issues/11


I use PSHUFB to convert 24-bit RGB to 32-bit RGBX or BGRX. Without a shuffle instruction it'd be quite a bit harder.


Here is an overview of the usage of the shuffle instruction to speed up decompression in ClickHouse: https://habr.com/ru/company/yandex/blog/457612/


I find the vpmullq part the most stunning.

This instruction is used in some bignum code, for example if you are implementing RSA. Yet AMD implemented it three times faster than Intel.

I'm also fascinated by AMD now making AVX512 worthwhile on consumer devices (where they would until quite recently artificially slow down Intel CPUs that had it), which presumably will lead to widespread adoption where it matters. Intels strategy of turning off AVX512 in the recent consumer devices because their energy efficiency cores don't have it may turn out to be a monumental mistake.


No one is going to be able to seriously use and support AVX512 (or be sufficiently motivated to implement support for it in their libraries and especially applications) until Intel finally gets its act together with regards to AVX512 and decides it actually wants to commit to it being a thing.

The AVX2 rollout was (comparatively) flawless. The gains AVX512 brings over AVX2 are, for most people w/ specialty libs excluded, not worth dealing with the terrible CPU support. And Intel just keeps making the situation worse, taking one step forward and two back.


Imagine next gen consoles, suppose they stick with AMD. Then every game studio and game engine studio is going to love flinging some AVX-512 around. Developers will get more experience with it, any game that runs on PC and Console is going to look slow on PC if you have intel cpus with bad support. More libraries and tools will get created that people will want to use.

Adoption could accelerate quick!


Next-next gen consoles are probably still a good 5+ years away. AVX-512 for consumers will either have already become "a thing" or it'll be dead & buried by then.


People said that about it 5 years ago to. Yet here we are. Nobody is going to just get rid of it, servers are already using it.


The biggest problem is not support for the instruction set in the silicon, but the performance penalty it brings.

SIMD hardware is the most power hungry block on Intel CPUs, and the frequency penalty it brings is never completely disclosed in the tech docs. Even Intel doesn't share that information with you (as a serious customer) sometimes.

In HPC world, no instruction is too obscure or niche to use. However, when you use these instructions too frequently, the heat load it generates can slow you down instead of accelerating you over the course of your job, so AVX512 is a pretty mixed case in Intel CPUs.

Regardless of this penalty, numeric code benefits from wider SIMD pipelines in most cases. At worst, you see no speedup, but you're investing for the future.

On the other hand, we have seen applications which run faster on previous generation hardware due to over-optimization.


> However, when you use these instructions too frequently, the heat load it generates can slow you down

It's not the heat load that slows you down. If you are using them enough that you produce enough heat that you have to downclock, it's still a win because the instructions improved your throughput more than what you lost in clocks.

The problem with Intel's initial AVX-512 implementation was that they didn't clock down because of heat, they clocked down pre-emptively and substantially whenever the CPU executed even a single AVX-512 instruction, even if there was no added heat load, and stayed on the lower clocks for a long period. This worked fine any proper SIMD loads, but was crushing in any situation where there was just a handful of AVX-512 ops between long stretches, such as using an AVX-512 optimized version of some library function.


> [T]hey clocked down pre-emptively and substantially whenever the CPU executed even a single AVX-512 instruction...

Because you were hitting the power envelope limits in the CPU in these cases too. You might not see the heat, but the CPU cannot carry the power required to keep that core at non-AVX speeds with these power-hungry blocks operated at full speed.

As I said, to add insult to the injury, Intel didn't share the exact details of its AVX implementations and frequency ranges it operates, either.

Ah, publicly sharing your findings is/was forbidden too.


No, as the above poster said, Intel slows down the CPU before any actual increase in power consumption or temperature occurs, because their fear that their power limit and temperature controller will not be able to react fast enough when the power increase eventually happens.

Whatever control mechanism is used in the AMD Zen CPUs is better than Intel's, so they downclock only when the power consumption really increases and the clock frequency recovers when the power consumption decreases, so there is no penalty when using sporadically some 512-bit instructions, like in the Intel CPUs.


No, actually grandparent is correct here - Skylake-X/Skylake-SP have never clocked down the first time they see an AVX-512 instruction. It actually is when the AVX instructions start to get dense enough to justify a voltage swing upwards. This actually exists in Haswell as well - certain AVX2 instructions are designated "heavy" and if you get enough of them you'll enter a voltage or frequency transition.

On Skylake-X there are more states... AVX-512 light and heavy as well.

https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html#...

https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...


Skylake-X on X299 can be configured to not downclock at all, the light and heavy stages are referred in BIOS as AVX2- and AVX512-offsets, but of course that comes with extra heat and power draw. The voltage transition period can't be mitigated AFAIK.


If I understood this correctly, this is for desktop chips and systems. My experience is solely based on Xeon family of processors.

Our systems doesn’t feature a similar override, but we can adjust the thermal and power envelope of the processors and system in general.

When we get a new bunch of systems, I’ll look into it, but my hopes are not that high.

Maybe we’ll get AMD systems this time, who knows.


> The biggest problem is not support for the instruction set in the silicon, but the performance penalty it brings.

Why is that sentence present tense instead of past tense? I suppose it continues to be a problem for Intel, but your comment appears to be presenting Intel downsides as if they were universal. Zen 4 apparently implements AVX-512 efficiently, without the problems Intel implementations experienced. That's what this whole discussion is about, and that's what Phoronix found as well.[0]

Hopefully Intel will catch up to AMD on AVX-512, but in the mean time, people optimizing software should be aware that AVX-512 has few (if any) downsides on certain platforms. Phoronix found zero performance penalty, but perhaps more testing is required.

[0]: https://www.phoronix.com/review/amd-zen4-avx512/6


Because it's hard to say "conditionally enable AVX-512 only on platforms supporting it BUT not on platforms where it actually brings performance penalties."


Can you give a real example of AVX-512 actually causing a net performance penalty on any CPU?

The only way I can see that happening is using AVX-512 in small, infrequently called functions such as strcmp, and the solution is: don't do that.

If proper SIMD code runs for say 1ms at a time, it's pretty much guaranteed to benefit from any implementation of AVX-512.


> The gains AVX512 brings over AVX2 are, for most people w/ specialty libs excluded

The last two things I worked on, image compression and quicksort, see 1.4-1.6x end to end speedups from AVX-512 vs AVX2. Is that sufficiently motivating? Especially because the only thing we had to do was ensure that CI machines are AVX-512 capable so that those test codepaths also run.

The "terrible CPU support" is a fact of life, not just in x86 (AES is 'optional' in SVE2, sigh), and so we deal with it via runtime dispatch - using what the CPU supports.


I don't see why performance-critical code wouldn't have an AVX512 implementation in addition to a scalar or SSE or AVX2 fallback, if AVX512 gives a big enough speed-up on a large enough number of relevant devices.


> This instruction is used in some bignum code

Could you be more specific? I think for that to work one would also need the upper half of 64x64 multiplication and `vpmullq` provides only the lower half. You could break one 64x64 multiplication into four 32x32 multiplications (i.e. emulate the full 64x64 = 128 bits multiplication) but I was under the impression that this was slow.


I assume that as you say, whoever used this instruction was using it for multiplying 32-bit numbers.

On AMD Zen 4 and Intel Cannon Lake or newer (when AVX-512 is supported), the fastest method to multiply big numbers is to use the IFMA instructions, which reuse the floating-point multipliers to generate 104-bit products of 52-bit numbers.


vpmullq is not that useful; in bignum code you also want the upper part of the product, and there is no corresponding vpmulhq instruction to get that.

On the other hand, vpmadd52luq and vpmadd52huq do give you access to the lower and upper parts of a 52x52->104 bit product, and those instructions perform well in the Intel chips, 3x faster than vpmullq.


phoronix:AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 7950X

https://www.phoronix.com/review/amd-zen4-avx512

"On average for the tested AVX-512 workloads, making use of the AVX-512 instructions led to around 59% higher performance compared to when artificially limiting the Ryzen 9 7950X to AVX2 / no-AVX512.

From these results I am rather impressed by the AVX-512 performance out of the AMD Ryzen 9 7950X. While initially being disappointed when hearing of their "double pumping" approach rather than going for a 512-bit data path, these benchmark results speak for themselves. For software that can effectively make use of AVX-512 (and compiled so), there is significant performance uplift to enjoy while no negative impact in terms of reduced CPU clock speeds / higher power consumption (with oneDNN being one of the only exceptions seen so far in terms of higher power draw).

AVX-512 is looking good on the Ryzen 7000 series and I'll continue running more benchmarks over the weeks ahead. These AVX-512 results make me all the more excited for AMD EPYC "Genoa" where AVX-512 can be a lot more widely-used among HPC/server workloads. "


I wonder how much of that 59% gain comes from the 512bit registers/instructions themselves, and how much comes from the new instructions and modes that come with AVX-512, and can still be used with the narrower 256bit and 128bit registers.

Would be interesting to modify some of the benchmarks to be limited to 256bit AVX-512 and see how they compare.


Mysticals report indicates much of it does come from wider instructions, because it can saturate the core easier. Zen 3 was front end bottlenecked, so on Zen4 running AVX512 it can more often hit 4x256. The new instructions are useful and some help perf, but mostly only for pretty specialized stuff. Masking is nice but I think people really exaggerate the improvement from it, vblend was only 2 cycles.


> And it is basically impossible to hit the 230W power limit at stock without direct die or sub-ambient.

Almost, but not quite. In GamersNexus' review they recorded 250.8W measured at the EPS12V cables, while using an Arctic Cooling Liquid Freezer II 360mm AIO with the fans at 100%. At 230W/1.5V=153A a good VRM will generate about 17W of heat. That leaves you a few watts for board power plane and socket resistive losses (I don't have an estimate for that).

Not a very practical cooling solution for a day-to-day workstation, but I do wonder if you could reduce the fan speeds a bit while still maxing out the power limit.


Worth noting that GN measured the power before the VRMs, while the limit is applied after the VRMs. Assuming a 90% efficiency, what GN measured would be 225.7W at the socket. Close, but still not quite.


I accounted for VRM efficiency losses in the second sentence, using data from a real X570 VRM.


What if you assume 91.7% efficiency?


Then again, I have 2 420mm Black Ice Nemesis radiators in my custom loop - even at relatively low speeds it can keep the 5800X in there and 3080 Ti cool under constant high loads.


My mini-ITX work desktop has no problems with a 5900x and a Radeon VII pro running Rocm work, using only a tiny heatsink on the 5900x (and some high-airflow fans, but nothing too incredibly loud). It doesn't thermal throttle, but tops out around 80-90 degrees C.

The 7000-series seems to be a different story: you really need a big cooler for those chips.


I have the same CPU + GPU combination, but used on an ATX MB with a Noctua cooler with a double 120-mm fan.

While the larger case and cooler makes the cooling easier, the fans are normally inaudible and the CPU stays under 45 Celsius degrees when not doing heavy work, and the temperature may raise up to a little over 60 degrees Celsius when 100% busy.

From what I have seen until now, cooling will no longer be so easy for the 7000 series, unless you choose to run them in the Eco mode.


What computer case/chasis are you using that fits those radiators? Was it an easy build?


Corsair 7000D Airflow - it's a full tower.

Easy build, the case has a lot of room to route. I went with ZMT tubing rather than hard tubing, as I didn't want to deal with rigid fittings, I prefer the aesthetic, and it's easier for routing when you can have things other than right angles!

One important note - the front rad is an X-Flow rather than traditional U-Flow.


Amusingly, there seems to still be 70% of performance if limiting the power to 65W.

This means the default power limits are not reasonable, and only there to win the release day benchmarks.


You can already do something like this if you aren't afraid to go deep into bios settings.

Not every computer is going to have have great cooling. Small cases won't be able to cool the full power and will end up heating up and throttling, which will produce hiccups in the GUI.


Yes, PPT is the setting, expressed in watt.

This definitely works on Zen2 onward, unsure about Zen/Zen+.


I feel like AMD having their "eco mode" is a pretty good solution - it allows the user a pretty easy method of deciding if peak performance or efficiency matters more to them. Better than a separate SKU for each.

Though this comes at the expense of reviews possibly judging things based on the "incorrect" setting


It seems that the return of competition has led to increasing power draw. At least in the enthusiast market where, it seems, customers don't care much.


I care more about my time than I do about my power usage.


Haha, as someone who has been shouting "no, really, AVX-512 is good, even if it's double-pumped, just wait for it guys" into the void for years now, glad to see it finally hit the desktop for real and that the AVX people are already leaning into it.

Years and years of "nobody needs AVX-512" and "linus says it's just for benchmarks, he worked at transmeta two decades ago, he knows better than Lisa Su" hot takes down the tubes ;)


Not many people realize is that recent glibc brought AVX-512 optimized str* and mem* functions to the ifunc dispatch table, your C code may have been using fancy mask registers on someone's Intel laptop!


> For all practical purposes, under a suitable all-core load, the 7950X will be running at 95C Tj.Max all the time. If you throw a bigger cooler on it, it will boost to higher clock speeds to get right back to Tj.Max. Because of this, the performance of the chip is dependent on the cooling. And it is basically impossible to hit the 230W power limit at stock without direct die or sub-ambient.

> If 95C sounds scary, wait to you see the voltages involved. AMD advertises 5.7 GHz. In reality, a slightly higher value of 5.75 GHz seems to be the norm - often across half the cores simultaneously. So it's not just a single core load. The Fmax is 5.85 GHz, but I have never personally seen it go above 5.75.

5.75 GHz is reached with 1.5 V Vcore.

The +50 MHz bump over advertised boost clocks was also present in Zen 3, likely in response to the poor reception of Zen 2 behavior, which would usually fail to achieve the advertised clocks.


I'm genuinely curious of the details of how the 1.5v vCore measurement was obtained. CPU-Z and software measurements in general don't have the greatest reputation of being accurate, especially with just-released generations of CPUs. Conventional wisdom has been with newer manufacturing processes, less voltage is required (and tolerated), and 1.5v vCore sounds truly insane in 2022 for a "4nm" chip. For reference, I haven't heard of 1.5v being a safe "24/7" voltage since the days of 90nm-130nm+ CPUs circa 2005-2006. IIRC casual overclockers in the forums weren't really comfortable with 1.5v even with 65nm Core 2, and this was back when it was common to e.g., safely overclock your 2.4 GHz Core 2 Quad to 3.4 GHz.


1.5V was the normal operating voltage of the 130nm "Northwood" Pentium 4s. For 65nm processors it's usually around 1.3-1.4. The problem with high voltages is that while the CPU may appear to work fine for a while, it may start becoming unstable and then suddenly die.

Part of me wonders whether the CPU manufacturers decided to, after a very long period of being conservative with lifetime and seeing their products last too long to their liking combined with the diminishing increases in performance with each new model, go all-out with voltages that are guaranteed to cause failure as long as they could make almost all of them happen just outside of warranty; not unlike what the LED lighting market has done.


Probably used same registers as previous.

Would be simple to confirm with some scope probing CPU power.


The problem is the CPU itself isn't the one measuring voltage, it gets that information from the motherboard's VRM controller. The accuracy of the reported value can vary depending on the controller, how it's configured by the motherboard's firmware, and the physical circuit design.

That being said, with new motherboards generally using fully digital VRM controllers the reported value should be pretty close in most cases.


AMD has worked to make sure that everything lines up for this release with the tools people use. Their Ryzen Master (pretty good tool for tweaking ryzen settings) got a bit of a rework for zen4 as well.


Exciting stuff. AVX512 isn't just for specialized work projects. It's also a huge performance boost for game console emulation.


When I was doing some work on Dolphin's JIT, AVX implementations were always back of mind. It's a massive tradeoff in so many cases but having access to these is amazing.


That sounds like a specialized project :)


Any project becomes specialized if you work long enough on it.


Interesting! Any reason why they're specifically good for emulation?



I don't have a deep understanding of the implementation, but it gets you a 30% performance boost in Playstation 3 emulation.

https://www.tomshardware.com/news/ps3-emulator-avx-512-30-pe...


The BF16 and VNNI instructions are finally going to make AMD competitive for neural network inference.


Competitive with what? GPUs? Only for networks that fit in a cache and even then only in latency not in throughput. Still valuable but not competitive in the majority of use cases.


AMD EPYC Genoa will be a killing machine, with almost 1TBytes/sec memory bandwidth and this avx512 extension...

good luck for intel's xeon.


I've updated most of my home lab to AMD EPYC Rome processors. Really can't beat the core counts for private cloud and the price is amazing compared to Intel. Looking forward to Genoa myself, though moving past Rome will be a ways away for my lab.


Sounds like a hell of a lab! What you're doing on it, machine learning ?


Ceph, OpenStack and about 240TiB raw of "Linux ISOs".


Genuinely good luck, it's never good when there is no competition


Yes. There needs to be a competition or we'll go back to the days of Intel just stalling.

It's why I want Intel Arc to be decent, there needs to be more players in the GPU space.


I'm hearing 12 channels @ 5200 MT/sec or so. Sounds like 500GB/sec, not 1TB/sec. Oh maybe you meant in a dual socket config?


Very interesting read. The author notes that double pumping the 512 bit instructions to 256 bit execution units appears to be a good trade-off.

As far as I understood ARMs new SIMD instruction set is able to map to execution units of arbitrary width. So it sounds to me like ARM is ahead of x86 in flexibility here and might be able to profit in the future.

Maybe somebody with more in-depth knowledge could respond whether my understanding is correct.


With any traditional ISA with wide registers and instructions, a.k.a. SIMD instructions, it is possible to implement the execution units with any width desired, regardless which is the architectural register and instruction width.

Obviously, it only makes sense for the width of the execution units to be a divisor of the architectural width, otherwise they would not be used efficiently.

Thus it is possible to choose various compromises between the cost and the performance of the execution units.

However, if the ISA specifies e.g. 32 512-bit registers, then even the cheapest implementation must include at least that amount of physical registers, even if the execution units may be much narrower.

What is new in the ARM SVE/SVE2 and which gives the name "Scalable" to that vector extension, is that here the register width is not fixed by the ISA, but it may be different between implementations.

Thus a cheap smartphone CPU may have 128-bit registers, while an expensive server CPU for scientific computation applications might have 1024-bit registers.

With SVE/SVE2, it is possible to write a program without knowing which will be the width of the registers on the target CPU.

Nevertheless, the scalability feature is not perfect, thus some programs may still be made faster if a certain register width is assumed before compilation, which may make them run slower than possible on a CPU that in fact has wider registers than assumed.


ARM's SVE is definitely interesting, but I do wonder if it is slowly be honing in on CRAY style vector processing. Which is definitely a cool idea, but a little different from the now-popular fixed-width SIMD. I don't know that it makes sense to call one ahead of the other yet -- ARM's documentation is clear that SVE2 doesn't replace NEON. "Mostly scalar but let's sprinkle in some SIMD" coding will probably always be with us (until ML somehow turns all programs into dot products I guess!)

RISC-V also has a variable length vector extension.


There's not really any reason for modern general-purpose CPUs to specialize for IPC lower than 1 like what Cray did. CPUs need wide frontends to execute existing scalar code as fast as we're used to, and if you're not reusing most of that width for vectors then the design is just wasting power.


The key difference is that on SVE, the hardware specifies the vector width, whilst with AVX, the programmer specifies the vector width. Of course, the actual hardware EU doesn't have to match either of these.

There are various benefits and drawbacks to each approach.


This reminds me of the Z-80, which replaced the 8080's 8-bit ALU with a 4-bit unit. Evidently FF recognized that it had no need to produce an answer in one cycle, because it would not be needed so soon, so why waste logic trying?

At the time, a 4MHz Z-80 ran about as fast as a 1MHz 6502, so that suggests opposite ends of that tradeoff range. 6502 had to do all its work in a cycle, but cycles were longer, so there was more time to do it in. Z-80 got half the work done, then the other half, with two more cycles left over for whatever.

This is what AVX2 used to do, too, working on 128 bits at a time. Probably when 3nm comes along, they will make it 512 bits wide if they can't think of and justify any better way use up their area budget allocation.


He posted benchmarks on a separate thread: https://www.mersenneforum.org/showthread.php?t=28107


(Tiit




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: