Very interesting read. The author notes that double pumping the 512 bit instructions to 256 bit execution units appears to be a good trade-off.
As far as I understood ARMs new SIMD instruction set is able to map to execution units of arbitrary width. So it sounds to me like ARM is ahead of x86 in flexibility here and might be able to profit in the future.
Maybe somebody with more in-depth knowledge could respond whether my understanding is correct.
With any traditional ISA with wide registers and instructions, a.k.a. SIMD instructions, it is possible to implement the execution units with any width desired, regardless which is the architectural register and instruction width.
Obviously, it only makes sense for the width of the execution units to be a divisor of the architectural width, otherwise they would not be used efficiently.
Thus it is possible to choose various compromises between the cost and the performance of the execution units.
However, if the ISA specifies e.g. 32 512-bit registers, then even the cheapest implementation must include at least that amount of physical registers, even if the execution units may be much narrower.
What is new in the ARM SVE/SVE2 and which gives the name "Scalable" to that vector extension, is that here the register width is not fixed by the ISA, but it may be different between implementations.
Thus a cheap smartphone CPU may have 128-bit registers, while an expensive server CPU for scientific computation applications might have 1024-bit registers.
With SVE/SVE2, it is possible to write a program without knowing which will be the width of the registers on the target CPU.
Nevertheless, the scalability feature is not perfect, thus some programs may still be made faster if a certain register width is assumed before compilation, which may make them run slower than possible on a CPU that in fact has wider registers than assumed.
ARM's SVE is definitely interesting, but I do wonder if it is slowly be honing in on CRAY style vector processing. Which is definitely a cool idea, but a little different from the now-popular fixed-width SIMD. I don't know that it makes sense to call one ahead of the other yet -- ARM's documentation is clear that SVE2 doesn't replace NEON. "Mostly scalar but let's sprinkle in some SIMD" coding will probably always be with us (until ML somehow turns all programs into dot products I guess!)
RISC-V also has a variable length vector extension.
There's not really any reason for modern general-purpose CPUs to specialize for IPC lower than 1 like what Cray did. CPUs need wide frontends to execute existing scalar code as fast as we're used to, and if you're not reusing most of that width for vectors then the design is just wasting power.
The key difference is that on SVE, the hardware specifies the vector width, whilst with AVX, the programmer specifies the vector width.
Of course, the actual hardware EU doesn't have to match either of these.
There are various benefits and drawbacks to each approach.
As far as I understood ARMs new SIMD instruction set is able to map to execution units of arbitrary width. So it sounds to me like ARM is ahead of x86 in flexibility here and might be able to profit in the future.
Maybe somebody with more in-depth knowledge could respond whether my understanding is correct.