There is more than one way to scale. Over the last decade, Intel had been pushing wider SIMD.
Instead of making your cpu able to execute more instructions per cycle, why don't you make each instruction do more work. SSE packs four floats/ints or two doubles/longs into a single 128bit register and then you can do the same ALU operation to each lane.
It works great on certain workloads.
With AVX, Intel increased the size of these registers to 256bit (eight floats) in 2011 and are currently pushing AVX512 doubles the width again (16 floats).
Apple, and ARM in general are limited to 128bit vector registers (though they are plans to increase that in the future)
Cinebench is well known as a benchmark which takes advantage of the 256bit AVX registers, and some people have speculated that Apple's M1 might be at a significant disadvantage because of this, with just half the ALU thoughput.
But these numbers show that while cinebench gets a notable boost from AVX, it's not as large as you might think (at least on this workload), allowing the M1's IPC advantage to shine though.
A SIMD is basically controller + ALUs. A wider SIMD gives a better calculation to controller ratio. Fewer instructions decreases pressure on the entire front-end (decoder, caches, reordering complexity, etc). This is more efficient overall if fully utilized.
The downsides are that wide units can affect core clockspeeds (slowing down non-SIMD code too), programmers must optimize their code to use wider and wider units, and some code simply can't use execution units wider than a certain amount.
Since x86 wants to decrease decode at all costs (it's very expensive), this approach makes a lot of sense to push for. If you're doing math on large matrices, then the extra efficiency will make a lot of sense (this is why AVX512 was basically left to workstation and HPC chips).
Apple's approach gambles that they can overcome the inefficiencies with higher utilization. Their decode penalty isn't as high which is the key to their strategy. They have literally twice the decode width of x86 (8-wide vs 4-wide -- things get murky with x86 combined instructions, but I believe those are somewhat less common today).
In that same matrix code, they'll have (theoretically) 4x as many instructions for the same work as AVX512 (2x vs AVX2, so we'd expect to see the x86 approach pay off here. In more typical consumer applications, code is more likely to use intermittent vectors of short width. If the full x86 SIMD can't be used, then the rest is just transistors and power wasted (a very likely reason why AMD still hasn't gone wider than AVX2).
To keep peak utilization, M1 has a massive instruction window (a bit less than 2x the size as Intel and close to 3x the size of AMD at present). This allows it to look far ahead for SIMD instructions to execute and should help offset the difference in the total number of instructions in SIMD-heavy code too.
Now, there's a caveat here with SVE. Scalable vector extensions allow the programmer to give a single instruction along with the execution width. The implementation will then have the choice of using a smaller SIMD and executing a lot or a wider SIMD and executing fewer cycles. The M1 has 4 floating point SIMD units that are supposedly identical (except that one has some extra hardware for things like division). They could be allowing these units to gang together into one big SIMD if the vector is wide enough to require it. This is quite a bit closer to the best of both worlds (still have multiple controllers, but lose all the extra instruction pressure).
Instead of making your cpu able to execute more instructions per cycle, why don't you make each instruction do more work. SSE packs four floats/ints or two doubles/longs into a single 128bit register and then you can do the same ALU operation to each lane.
It works great on certain workloads.
With AVX, Intel increased the size of these registers to 256bit (eight floats) in 2011 and are currently pushing AVX512 doubles the width again (16 floats).
Apple, and ARM in general are limited to 128bit vector registers (though they are plans to increase that in the future)
Cinebench is well known as a benchmark which takes advantage of the 256bit AVX registers, and some people have speculated that Apple's M1 might be at a significant disadvantage because of this, with just half the ALU thoughput.
But these numbers show that while cinebench gets a notable boost from AVX, it's not as large as you might think (at least on this workload), allowing the M1's IPC advantage to shine though.