Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The biggest problem is not support for the instruction set in the silicon, but the performance penalty it brings.

SIMD hardware is the most power hungry block on Intel CPUs, and the frequency penalty it brings is never completely disclosed in the tech docs. Even Intel doesn't share that information with you (as a serious customer) sometimes.

In HPC world, no instruction is too obscure or niche to use. However, when you use these instructions too frequently, the heat load it generates can slow you down instead of accelerating you over the course of your job, so AVX512 is a pretty mixed case in Intel CPUs.

Regardless of this penalty, numeric code benefits from wider SIMD pipelines in most cases. At worst, you see no speedup, but you're investing for the future.

On the other hand, we have seen applications which run faster on previous generation hardware due to over-optimization.



> However, when you use these instructions too frequently, the heat load it generates can slow you down

It's not the heat load that slows you down. If you are using them enough that you produce enough heat that you have to downclock, it's still a win because the instructions improved your throughput more than what you lost in clocks.

The problem with Intel's initial AVX-512 implementation was that they didn't clock down because of heat, they clocked down pre-emptively and substantially whenever the CPU executed even a single AVX-512 instruction, even if there was no added heat load, and stayed on the lower clocks for a long period. This worked fine any proper SIMD loads, but was crushing in any situation where there was just a handful of AVX-512 ops between long stretches, such as using an AVX-512 optimized version of some library function.


> [T]hey clocked down pre-emptively and substantially whenever the CPU executed even a single AVX-512 instruction...

Because you were hitting the power envelope limits in the CPU in these cases too. You might not see the heat, but the CPU cannot carry the power required to keep that core at non-AVX speeds with these power-hungry blocks operated at full speed.

As I said, to add insult to the injury, Intel didn't share the exact details of its AVX implementations and frequency ranges it operates, either.

Ah, publicly sharing your findings is/was forbidden too.


No, as the above poster said, Intel slows down the CPU before any actual increase in power consumption or temperature occurs, because their fear that their power limit and temperature controller will not be able to react fast enough when the power increase eventually happens.

Whatever control mechanism is used in the AMD Zen CPUs is better than Intel's, so they downclock only when the power consumption really increases and the clock frequency recovers when the power consumption decreases, so there is no penalty when using sporadically some 512-bit instructions, like in the Intel CPUs.


No, actually grandparent is correct here - Skylake-X/Skylake-SP have never clocked down the first time they see an AVX-512 instruction. It actually is when the AVX instructions start to get dense enough to justify a voltage swing upwards. This actually exists in Haswell as well - certain AVX2 instructions are designated "heavy" and if you get enough of them you'll enter a voltage or frequency transition.

On Skylake-X there are more states... AVX-512 light and heavy as well.

https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html#...

https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...


Skylake-X on X299 can be configured to not downclock at all, the light and heavy stages are referred in BIOS as AVX2- and AVX512-offsets, but of course that comes with extra heat and power draw. The voltage transition period can't be mitigated AFAIK.


If I understood this correctly, this is for desktop chips and systems. My experience is solely based on Xeon family of processors.

Our systems doesn’t feature a similar override, but we can adjust the thermal and power envelope of the processors and system in general.

When we get a new bunch of systems, I’ll look into it, but my hopes are not that high.

Maybe we’ll get AMD systems this time, who knows.


> The biggest problem is not support for the instruction set in the silicon, but the performance penalty it brings.

Why is that sentence present tense instead of past tense? I suppose it continues to be a problem for Intel, but your comment appears to be presenting Intel downsides as if they were universal. Zen 4 apparently implements AVX-512 efficiently, without the problems Intel implementations experienced. That's what this whole discussion is about, and that's what Phoronix found as well.[0]

Hopefully Intel will catch up to AMD on AVX-512, but in the mean time, people optimizing software should be aware that AVX-512 has few (if any) downsides on certain platforms. Phoronix found zero performance penalty, but perhaps more testing is required.

[0]: https://www.phoronix.com/review/amd-zen4-avx512/6


Because it's hard to say "conditionally enable AVX-512 only on platforms supporting it BUT not on platforms where it actually brings performance penalties."


Can you give a real example of AVX-512 actually causing a net performance penalty on any CPU?

The only way I can see that happening is using AVX-512 in small, infrequently called functions such as strcmp, and the solution is: don't do that.

If proper SIMD code runs for say 1ms at a time, it's pretty much guaranteed to benefit from any implementation of AVX-512.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: