Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am a happy owner of a Tigerlake (Intel 11th Gen) Framework laptop. I've considered upgrading to a 12th or 13th Gen motherboard, and while I have no doubt they'd be great for me as a Gentoo developer with the greatly increased core counts, my hesitation is that the new CPUs have AVX-512 disabled.

Maybe this doesn't matter, almost certainly wouldn't for most people, but I'm compiling the whole system myself so the compiler at least has the freedom to use AVX-512 wherever it pleases. Does anyone know if AVX-512 actually makes a difference in workloads that aren't specifically tuned for it?

My guess is that given news like https://www.phoronix.com/news/GCC-AVX-512-Fully-Masked-Vecto... that compilers basically don't do anything interesting with AVX-512 without hand-written code.



The promise of the AVX-512 instruction set really was that it would be much easier to (auto-)vectorize code that wasn’t written with vectorization in mind, with tools like masked execution and gather/scatter that either didn’t exist at all before (SSE) or were very minimal (AVX).

The tools are there in the instruction set, but that still leaves the issues of time and effort to implement in compilers, and enough performance improvement on enough machines in some market (browsers, games, etc) capable of running it all before any of this possibility becomes real.

The skylake-xeon/icelake false start here really can’t have helped. It’s still a much more pragmatic thing to target the haswell feature set that all the intel chips and most amd chips can run (and run well).


Funny that if you want AVX-512 now, it's AMD that's offering it and Intel that isn't.

Sometimes the second comer to a game has the advantage of taking their time to implement something, with fewer compromises and a better overall fit.


The compiler will only choose to use AVX-512 if you give it the right `-m` flags. Most people who are running generic distros that target the basic k8 instructions benefit from AVX-512 only when some library has runtime dispatch that detects the presence of the feature and enables optimized routines. This is common in, for example, cryptography libraries.


Right. Since I'm using Gentoo and compiling my whole system with `-march=tigerlake`, the compiler is free to use AVX-512.

My question is just... does it? (And does it use AVX-512 profitably?)


It will not use AVX-512 if you have CFLAGS="-march=tigerlake -O2". You will, at the very least, need CFLAGS="-march=tigerlake -O3" to get it to actually use AVX2, and tigerlake's AVX512 implementation is so poor (clock throttling etc) that gcc will not use AVX-512 on tigerlake. AVX-512 is used if you have -march=znver4 though, so the support for autovectorizing to AVX-512 is clearly there.

https://godbolt.org/z/1a39Mf3bv


Is it actually that bad on Tiger Lake? Or just for really high-width vectors? On my old Ice Lake laptop, single-core AVX-512 workloads do not decrease frequency at all even with wider registers, and multi-core workloads will result in clock speed degradation of a small amount, maybe 100Mhz or so.

Depends on a couple factors (i.e. Ice Lake client only has 1 FMA unit) but I'd be surprised if Tiger Lake was a major regression relative to Ice Lake. It seems like they had it in an OK spot by then.


In my experience it depends on the compiler. clang seems far more willing to autovectorise than gcc. Also, when writing the code you have to write it in a way that strongly hints to the compiler that it can be autovectorised. So lots of handholding.


I guess a better question is why you rebuild the system without a rational basis to expect benefits.


Are you familiar with source-based distributions?

I'm not rebuilding specifically for this one potential optimization.


Why not use -march=native?


Surprisingly, -march=native doesn’t always expand to the locally optimal build flags we might expect, particularly with gcc on non-Linux platforms.


Oh interesting. Is this one of those things where backwards compatibility eventually got in the way of the intended purpose?


I actually do. I just said -march=tigerlake to make it clear what CPU family the compiler was targeting.


Why not use -march=snark?


> I've considered upgrading to a 12th or 13th Gen motherboard, and while I have no doubt they'd be great for me as a Gentoo developer with the greatly increased core counts, my hesitation is that the new CPUs have AVX-512 disabled.

Unless you have a very specific AVX-512 workload or you need to run AVX-512 code for local testing, you won’t see any net benefit of keeping your older AVX-512 part.

Newer parts will have higher clock speed and better performance that will benefit you everywhere. Skipping that for the possibility of maybe having some workload in the future where AVX-512 might help is a net loss.


Now you may choose a new AMD Phoenix-based laptop, with great AVX-512 support (e.g. with Ryzen 7 7840HS or Ryzen 9 7940HS or Ryzen 7 7840U).

AMD Phoenix is far better than any current Intel mobile CPU anyway, so it is an easy choice (and it compiles code much faster than Intel Raptor Lake, which counts for a Gentoo user or developer).

The only reason to not choose an AMD Phoenix for an upgrade would be to wait for an Intel Meteor Lake a.k.a. Intel Core Ultra. Meteor Lake will be faster in single-thread (the relative performance in multi-thread is unknown) and it will have a bigger GPU (with 1024 FP32 ALUs vs. 768 for AMD).

However, Meteor Lake will not have AVX-512 support.

For compiling code, the AVX-512 support should not matter, but it should matter a lot for the code generated by the compiler, as it enables the efficient auto-vectorization of many loops that cannot be vectorized efficiently with AVX2.

While gcc and clang will never be as smart as hand-written code, their automatic use of AVX-512 can be improved a lot and announcements like that linked by you show progress in this direction.


Does anyone know if AVX-512 actually makes a difference in workloads that aren't specifically tuned for it?

I know game console emulators use it to great effect with significant performance increases.


Incidentally that's another case where the 512bit-ness is the least interesting part, the new instructions are useful for efficiently emulating ARM NEON (Switch) and Cell SPU (Playstation 3) code but those platforms are themselves only 128bits wide so I don't believe the emulators have any use for the 512bit (or even 256bit?) variants of the AVX512 instructions.


I haven't looked into the code for these but are they possibly pipelining multiple ops per clock? If it's not dependency chained they probably calculate a few cycles at once.


Specifically RPCS3 had a huge speedup using AVX-512 [1]

1: https://www.tomshardware.com/news/ps3-emulation-i9-12900k-vs...


RPCS3 is a big fan of esoteric CPU features, it was also one of the very few applications which used Intels TSX before Intel killed it off.


Game console emulators are of course specifically tuned for this.


What other emulators beside rpcs3 use it?


AVX-512 is specifically the first x86 vector extension for which compilers should eventually be able to emit reasonable code. Thanks to gather and masked execution, with AVX-512 vectorizing a simple loop doesn't always mean blowing up code size to 10x.

However, compilers have so far been slow to implement this, with the relevant patches only going into GCC right now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: