> Overall the new A12 Vortex cores and the architectural improvements on the SoC’s memory subsystem give Apple’s new piece of silicon a much higher performance advantage than Apple’s marketing materials promote. The contrast to the best Android SoCs have to offer is extremely stark – both in terms of performance as well as in power efficiency. Apple’s SoCs have better energy efficiency than all recent Android SoCs while having a nearly 2x performance advantage. I wouldn’t be surprised that if we were to normalise for energy used, Apple would have a 3x performance efficiency lead.
Wow indeed. I'm impressed that Apple has managed to create and maintain such an insane lead in ARM performance for such a long period of time.
Does anyone know of more technical reasons for Apple's ARM processors outperforming everyone else's by such a large margin, and for such a long period of time? Seems like there's some fundamental difference in what Apple is doing, and I'd love to read more about it.
One factor called out in the article is the number of instructions that can be carried out simultaneously during a single clock cycle.
>Monsoon (A11) and Vortex (A12) are extremely wide machines – with 6 integer execution pipelines among which two are complex units, two load units and store units, two branch ports, and three FP/vector pipelines this gives an estimated 13 execution ports, far wider than Arm’s upcoming Cortex A76 and also wider than Samsung’s M3. In fact, assuming we're not looking at an atypical shared port situation, Apple’s microarchitecture seems to far surpass anything else in terms of width, including desktop CPUs.
Apple first moved to wide CPU designs with the Cyclone CPU core found in the A7 SOC first used in the iPhone 5s.
>With Cyclone Apple is in a completely different league. As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.
That's part of it, but doesn't fully answer the question. Decoding and issuing 6 instructions per cycle ordinarily is extremely costly in terms of power. And it's usually very hard to keep those execution units busy--it's hard to find six independent instructions to issue every clock cycle. How Apple built a 6-wide CPU within that power envelope, and optimized the compiler to actually use that IPC is the really interesting question.
Lower maximum clock speeds mean you have more FO4s to play with and potentially makes the fan-out issues in wide designs a bit more manageable. Decode I expect to be pretty easy, as long as your branch predictor is on target the power costs just grow linearly with decode width when all your instructions start at nice 32-bit boundaries.
Mostly I'm curious about how complete the bypass network is on their functional units and if execution is clustered like the POWER8. The width doubling in the A series does remind me of the POWER 7 to 8 transition.
Renaming is also apparently a major constraint on design width in many cases but I'm not so familiar with that.
What does "FO4" stand for here? Googling it yields "Fallout 4", which definitely isn't right, and I'm not sure what other keywords to tack on to get the right result.
Sorry, that a "fan-out of 4" which. Traditionally you look at circuit timings in terms of how long it takes one transistor to switch 4 other transistors of equal size. Wire capacitance is a lot more important these days so it's not necessarily the best metric anymore but it's still used. The fewer FO4s of delay you have in a pipeline stage the faster you can clock a chip. The fewer FO4s in your longest pipeline stage the faster you can clock your chip, though there's also a non-linear dependence on voltage. Because of that non-linearity I'd still expect a lower-clocked chip to have more complex pipeline stages. And you can only increase your speed by slimming down stages so much. The overhead of latches and accounting for clock jitter generally add 4 FO4s beyond the useful logic you accomplish in a pipeline stage.
Excellent explanation! Quick follow-up question: Why FO4 specifically, and not some other number/metric? Was (is?) that a particularly common structure in CPUs?
One respect in which I can imagine the x86 ISA being a real problem is in decode bandwidth. To issue 6 x86 instructions per cycle, either the front end needs to decode 6 per cycle, or it needs to cache decoded instructions. And x86 can’t be decoded in parallel without massive complexity because the instructions are variable length, and even determining the length requires mostly decoding the instruction.
It's true that decoding x86 is harder, but Sandy Bridge+ get most instructions from a uop cache, which delivers 4 fixed-length uops per cycle. You could make that 6 wide, but Intel doesn't because they wouldn't be able to fill that.
AArch64 has a larger register file and fewer dependencies in general than x86-64 does. For example, most instructions don't set flags. I don't know for sure, but that might be enough to raise the ILP sufficiently.
NetBurst (like the P6 architecture before it) was 3-way decode/3-way retire. (Actually, NetBurst could decode just one x86 op per cycle into the trace cache; the trace cache could deliver 3 uops per cycle if there was a cache hit.)
Just speculating but surely it must have something to do with the entire stack being designed under one roof, no? Having the kernel devs be able to walk across the campus and ask the hardware guys what a register is for must speed up development immensely.
I'm not sure that would explain plain CPU performance as shown here. I think SPEC compiles down to native code, so there should be only a fairly limited part of the whole stack involved here --- compiler guys for generating good code, kernel guys for scheduling/power management, SoC team for CPU/memory/GPU subsystem implementation.
I also thought kernel devs would be working against processor ISAs, not hardware-specific details beyond the ISA.
To some extent. I seem to recall people saying that by restricting the range of page sizes they could make the L1 cache virtually indexed but physically tagged instead of physically indexed and tagged as Android phone processors are. That means you can start the lookup before the address translation is complete but still avoid aliasing problems.
EDIT: But I think another part of it is just being willing to throw more transistors at the problem than Android phone SoC manufacturers are and also that their higher income makes spending more on engineering make sense.
They are, but that’s really low hanging fruit. If Android for example doesn’t use optimised memcpy implementations then they don’t deserve to exist as a serious OS.
That reminds me, when I read fast code in the 90s-2000s all the asm hackers were into writing their own cool memcpy. Were they just showing off, or did Windows actually never optimize their standard library?
People still seem to like writing their own cool malloc, but memcpy not so much.
There are some cases where it may make sense to write your own implementation, if you have a niche microarchitecture that has unusual performance characteristics that the OS doesn't provide optimized routines for by default. But for most u-archs the default optimized routines should do a good job.
Things like malloc are quite a bit more complicated, and more workload dependent so there's still some opportunity specializing an implementation there.
The article gets into some technical aspects, but focuses specifically on A11->A12 changes.
Mostly, A-series chips have enormous L1 cache, great cache hierarchy and management, and very low memory latency. A12 specifically seems to have included an almost total redesign of the cache hierarchy.
I'm sure there are many more reasons their designs significantly outperform competitors, but I've not seen any more publicly available analysis.
> Mostly, A-series chips have enormous L1 cache, great cache hierarchy and management, and very low memory latency.
Has this always been the case compared to contemporary Qualcomm/Exynos/etc. SoCs? Not implying your statement is wrong here; all I know is that the A-series chips have had a big performance advantage for a while now and I haven't read more detailed analyses in the past that may have given hints as to why.
Also, how difficult is cache hierarchy/management to get right? For something as fundamental to good performance these days as cache, I would have expected the major players to be on more or less the same playing field.
I think part of it is that Apple aquired a very smart team of chip designers. That, and they probably give them a bigger budget probably explains most of the difference. Also, apple have pursued 2 (very) fast cores whereas other chips often have 4.
Of course, this iteration apple have beaten almost everyone else to 7nm, so the difference is much more dramatic.
>I think part of it is that Apple aquired a very smart team of chip designers.
Yes, PA micro was the place where the "last of Mahicans" of US chip industry were.
>Also, apple have pursued 2 (very) fast cores whereas other chips often have 4.
Yes, because people into app development are as web developers, and the word "mutex" gives most of them a panic attack.
Android style java should've been more multi-threading friendly, but that does nothing about people not utilising them.
> this iteration apple have beaten almost everyone else to 7nm, so the difference is much more dramatic.
Yes. I remember how Mediatek beaten Apple to 10nm thanks to them being a Taiwanese company, but nevertheless "ruined it all" with their helio x30's design being designed with more marketing considerations than engineering ones. Their marketing guys couldn't wait to announce "hey we have 2 more cores than you Qualcomm!"
> Has this always been the case compared to contemporary Qualcomm/Exynos/etc. SoCs?
Yes, that's my understanding. I've seen this discussion last year and the year before. And every time the answer seems to be caches. Cache memory is expensive. Apple seems willing to pay more for the SoC in order to have an overall experience that lets them get away with the high prices.
From what I read, Qualcomm would not be able to sell at volume an equivalently performant SoC.
Could it also be that a lot more of the applications are written and compiled to the native machine versus the overhead of a VM (even if that is JIT). I'd guess Minecraft pocket edition would perform similar on both systems.
Even if more things are being compiled to native code instead of Dalvik, I don't think that would explain the benchmark results here, as I think SPEC is always compiled to native code. It seems there's something more fundamental to Apple's hardware that is allowing for such insane performance.
As for Minecraft Pocket Edition, isn't that written in native code anyways? So I'd expect it to perform better on Apple hardware, assuming the hardware is actually the bottleneck for performance.
Looks like things are still compiled to Dalvik bytecode for distribution, but get recompiled to native code upon installation. I didn't know that; I don't follow Adnroid closely, so the particulars of the runtime aren't something I'm familiar with. Still, TIL. Thanks!
Some of it is likely just the high volume sales of a limited number of expensive phone models. They have the money to spend.
Android manufacturers have more competition, and have to address the low end of the market too. Their money, and attention, is spread in a wider swath.
I dunno if I buy this.
Sure the spread of Android devices and manufacturers is wider across the price and cost spectrum, but there are 'high-end' Android phone manufacturer as well.
Samsung comes to mind, surely it's big enough to produce high-performance phones that can compete architecturally with Apple's SoC as well as addressing the developing nation / low-cost phone market?
Right, but while Samsung's hardware team is focused on designing silicon for low, mid, and high-end devices throughout the year, Apple is building one chipset for two or three high end models every year. Apple just shoves last years models further down the price spectrum rather than launching new low and mid range devices every year.
The tighter focus, combined with the fact that Apple rakes in more cash to spend on R&D, is why their engineering team is able to win out here.
"But there are 'high-end' Android phone manufacturer as well"
Sure, yes, but not at a volume that allows them to create a processor that competes with iPhone only on their flagship model. Also, they can't charge $999 for their flagship. The average sales price for an iPhone is higher than the flagship model at Samsung.
Given Apple can get so far ahead of the competition with ARM for small devices... I still think it's entirely possible they're going to make their own CPUs for desktop Macs, and that is the massive hold up on the "New Mac Pro 2019(ish)"
The last time they changed architectures, they stuffed a new motherboard in an old case, to let developers get ready for the change.
I think it's unlikely they'll update their flagship Mac to a new CPU architecture without notifying developers first. On day one, all existing apps would perform poorly, and that's not how you promote a new top-of-the-line system.
Except they do it like their transition to Intel and emulate all software with the old architecture. But their new architecture would have to be very much better than intel. Or maybe they transpile x86/64 to their arm Rchitecture.
”This also gives us a great piece of context for Samsung’s M3 core, which was released this year […] Here the Exynos 9810 uses twice the energy over last year’s A11 – at a 55% performance deficit.”
Doesn’t that mean the A11 already is three times as efficient as recent Android cores? (The Samsung M3 was in January’s Hot Chips)
Wow.