There is more than one way to scale. Over the last decade, Intel had been pushin...

hajile · on Nov 17, 2020

I'd note that both arguments have merit.

A SIMD is basically controller + ALUs. A wider SIMD gives a better calculation to controller ratio. Fewer instructions decreases pressure on the entire front-end (decoder, caches, reordering complexity, etc). This is more efficient overall if fully utilized.

The downsides are that wide units can affect core clockspeeds (slowing down non-SIMD code too), programmers must optimize their code to use wider and wider units, and some code simply can't use execution units wider than a certain amount.

Since x86 wants to decrease decode at all costs (it's very expensive), this approach makes a lot of sense to push for. If you're doing math on large matrices, then the extra efficiency will make a lot of sense (this is why AVX512 was basically left to workstation and HPC chips).

Apple's approach gambles that they can overcome the inefficiencies with higher utilization. Their decode penalty isn't as high which is the key to their strategy. They have literally twice the decode width of x86 (8-wide vs 4-wide -- things get murky with x86 combined instructions, but I believe those are somewhat less common today).

In that same matrix code, they'll have (theoretically) 4x as many instructions for the same work as AVX512 (2x vs AVX2, so we'd expect to see the x86 approach pay off here. In more typical consumer applications, code is more likely to use intermittent vectors of short width. If the full x86 SIMD can't be used, then the rest is just transistors and power wasted (a very likely reason why AMD still hasn't gone wider than AVX2).

To keep peak utilization, M1 has a massive instruction window (a bit less than 2x the size as Intel and close to 3x the size of AMD at present). This allows it to look far ahead for SIMD instructions to execute and should help offset the difference in the total number of instructions in SIMD-heavy code too.

Now, there's a caveat here with SVE. Scalable vector extensions allow the programmer to give a single instruction along with the execution width. The implementation will then have the choice of using a smaller SIMD and executing a lot or a wider SIMD and executing fewer cycles. The M1 has 4 floating point SIMD units that are supposedly identical (except that one has some extra hardware for things like division). They could be allowing these units to gang together into one big SIMD if the vector is wide enough to require it. This is quite a bit closer to the best of both worlds (still have multiple controllers, but lose all the extra instruction pressure).

phire · on Nov 18, 2020

I agree, both approaches have merit.

But at this point I really have to question what code gets decent speedups with AVX512 that wouldn't preform even better on a GPGPU.