Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So, in about 30 years when the majority of the CPUs have this, we can use it. Assuming intel does not gate this just to XEON for no reason whatsoever, like they did to AVX512?


Performance-intensive code (hot loops) could be compiled multiple times for each architecture extension and switched with CPUID. That's how all the various SSE and AVX extensions were rolled out in multimedia code.

AFAIK there was AVX512 in higher-end desktop SKUs, but not the latest E-core designs in Intel's client chips. The main problems are that:

- Operations at 512-bit register widths have significant power draw. Intel chips that support AVX512 have to downclock themselves on AVX512 workloads until their voltage regulators have boosted up to a higher voltage.

- The AVX512 register file is too big to physically fit in the E-core[0] footprint.

Incidentally I do remember Linus Torvalds specifically complaining that AVX512 was being used to implement memcpy in gcc, because it meant running certain programs would lower system performance. So these new architectures tend to be used a lot sooner than the time it takes for it to be safe to make them your minimum compile target.

[0] The BIOS on my Framework laptop refers to these as "Atom cores" - no clue if the current E-core design is derived from Atom or if this is a miscommunication or nickname AMI picked.


No, operations at 512-bit register widths do not have significant power draw, as demonstrated by AMD Zen 4.

What has significant power draw is the use of double 512-bit floating-point multipliers, as implemented in the Intel server CPUs (though one core with such multipliers draws significantly less power than two cores having the same throughput).

AMD uses only double 256-bit floating-point multipliers and in general it uses exactly the same execution units for both 256-bit and 512-bit operations, so the AVX-512 operations do not increase the power draw even when they use 512-bit registers.

Also the AVX512 register file is not too big to physically fit in the E-core. Even if the AVX512 register file is 4 times greater than the AVX register file, the E-cores have much a much larger register file used to rename the architecturally visible registers.

Despite these facts, Intel still believes that implementing the full AVX-512 ISA in the E-cores is too expensive, so they have created this new specification of AVX10/256, which is just a subset of AVX-512 including the instructions with an operand size up to 256 bits, and which will be implemented in all future E-cores after some date, perhaps starting in 2025.


Yes, E-cores are Atom cores. You can tell because they have -mont in the codename.


Moreover, besides their history, the E-cores continue to be sold using the Atom brand, for instance "Intel Atom® x7425E" (the industrial variant of the Intel N100, one of the Alder Lake N models).


Modern compilers already allow for conditional code execution depending on CPU sets, and at least in what concerns JVM implementations and the CLR, their JITs are clever enough to already use parts of AVX512, while they are not perfect, it is better than not using them at all.


"Tiered" x86 packages and executables seem like the inevitable direction for linux distros.

CachyOS and Clear Linux already do this.

Base Arch Linux and openSUSE are working on it. Maybe Fedora too, but I can't remeber

And it wouldn't be totally insane for Windows to do this either.


This is Gentoo's whole shtick. It's the core feature. It's been supported since day 0, over 20 years ago.

I'm honestly a little amazed that more people don't either use Gentoo or adopt its model, given the smorgasbord of mutually incompatible instruction set extensions. It seems very strange that people will buy a CPU that has 32 64-byte ZMM registers with 3 operand instructions, and then use that CPU to run code that operates on 8 16-byte XMM registers with 2 operand instructions.


Because building from source is extremely time/CPU consuming and also unreliable. Time is valuable. In my last Gentoo attempt, I had to manually fix a few build recipes before I threw in the towel.

This also means "riskier" methods (like LTO) have to be omitted by default.

Gentoo is great for libre software, security, manual patches, embedded computing and such. But for pure desktop performance, the Clear Linux way is best: aggressive compilation flags/libraries, tested by the package maintainers, shipped in 3-4 tiers. And as the Clear Linux devs said, most of the native instructions dont even matter, as the compilers can't use them.


Because most people have better shit to do than fix their now-broken system every other time they upgrade their packages. I used to run Gentoo, was told on IRC after like the 9,000th such breakage to go use something else if I didn't like their perpetually broken free distribution, and I took their advice.


The speedup for most code is really small and things like codecs that really benefit from AVX will detect and use it at runtime.


How does that work? The binary format embeds variants of the same program?


Yes, here is an example how it works for GCC.

https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc/Function-Multi...


On linux distros, the package manager downloads different binaries based on your CPU. Skylake would be x86-64-v3, Zen 4 would be x86-64-v4, for example.

And there are different schemes for multiple architectures in the same program, like hwcaps.


Isn’t this going to get very unmanageable very soon? Intel seems to add extensions every other year or so.


The extensions can be kinda broken down into 4 levels. Basically ancient, old (SSE 4.2), reasonably new (AVX2, Haswell/Zen 1 and up), and baseline AVX512.

https://developers.redhat.com/blog/2021/01/05/building-red-h...

There is discussion of a fifth level. Someone in the Intel Clear Linux IRC said a fifth level wasn't "worth it" for Sapphire Rapids because most of the new AVX512 extensions were not autovectorized by compilers, but that a new level would be needed in the future. Perhaps they were thinking of APX, but couldn't disclose it.


AVX10/APX does sound like a good baseline for v5.


except that it doesn't support full AVX-512, making the whole idea of backward compatibility between these levels meaningless. "It's Intel!!!"


Well that's an even better justification, as a x86-64-v5 level would be needed for the newer CPUs.

We can throw away any hope of v4 being a standard baseline.


It’s easy to fully automate and storage is relatively cheap these days.


I'd think the issue would be more build infra, every new variant means you have to build the world again


Again, compute is surprisingly cheap these days.

Work out what it would cost to compile - say - a terabyte of C code at typical cloud spot prices.

A large VM with 128 cores can compile the 100 MB Linux kernel source tree in about 30 seconds. So… 200 MB/minute or 12 GB/hour. This would take 80 hours for a terabyte.

A 120 core AMD server is about 50c per hour on Azure (Linux spot pricing).

So… about $40 to compile an entire distro. Not exactly breaking the bank.


you'd have to separate out compiling and linking at a bare minimum to get even a semi accurate model. plus a lot of userspace is c++, which is much, much slower.


Yes. Also, test it.


That can also be largely automated.


LTO does rarely break things in hard to detect ways, but I have never heard of a -march x86 compilation bug.


in the end it will be like any other modern hardware appliance:

the hardware is the same design for cost saving purposes, but different features are unlocked for $$$ by a software license key.

You want AVX-512? pay up and unlock feature in your CPU and you can now use the feature. This could also enable pay-as-you-go license scheme for CPUs, creating recurring revenue for Intel

from the hardware perspective - the same silicon, but different features sold separately


Maybe JIT compilers can take profit of this immediately, since they target a single machine?


Yup. It's one of their theoretical advantages that's about to become a lot less theoretical. Historically it hasn't made much difference because optional instructions were hard for JIT compilers for most languages to use (in particular high level JITd languages tend not to support vector instructions very well). But a doubling of registers is the sort of extension that any kind of code can immediately profit from.

Arguably it will be only JITd languages that benefit from this for quite a while. These sorts of fundamental changes are basically a new ISA and the infrastructure isn't really geared up to make doing that easy. Everyone would have to provide two versions of every app and shared library to get the most benefit, maybe even you get combinatorial complexity if people want to upgrade the inter-library calling conventions too. For native AOT compiled code it's going to just be a mess.


In what concerns the JVM and ART, and the CLR, it is quite practical, even if there is room for improvment.


Gentoo users will finally get to be smug again, once GCC/clang have support for them.


All the more reason that Wasm should be the bottom of software :)


IBM and Burroughs/Unisys have already been doing that for decades, with bytecode based executables for their mainframe/micros.

Or Xerox PARC, with their microcoded CPUs loading the desired interpreter on boot.

I guess, it is an idea that keeps being revalidated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: