> Overall the new A12 Vortex cores and the architectural improvements on the SoC...

aw1621107 · on Oct 5, 2018

Wow indeed. I'm impressed that Apple has managed to create and maintain such an insane lead in ARM performance for such a long period of time.

Does anyone know of more technical reasons for Apple's ARM processors outperforming everyone else's by such a large margin, and for such a long period of time? Seems like there's some fundamental difference in what Apple is doing, and I'd love to read more about it.

GeekyBear · on Oct 5, 2018

One factor called out in the article is the number of instructions that can be carried out simultaneously during a single clock cycle.

>Monsoon (A11) and Vortex (A12) are extremely wide machines – with 6 integer execution pipelines among which two are complex units, two load units and store units, two branch ports, and three FP/vector pipelines this gives an estimated 13 execution ports, far wider than Arm’s upcoming Cortex A76 and also wider than Samsung’s M3. In fact, assuming we're not looking at an atypical shared port situation, Apple’s microarchitecture seems to far surpass anything else in terms of width, including desktop CPUs.

https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-re...

Apple first moved to wide CPU designs with the Cyclone CPU core found in the A7 SOC first used in the iPhone 5s.

>With Cyclone Apple is in a completely different league. As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.

https://www.anandtech.com/show/7460/apple-ipad-air-review/2

rayiner · on Oct 5, 2018

That's part of it, but doesn't fully answer the question. Decoding and issuing 6 instructions per cycle ordinarily is extremely costly in terms of power. And it's usually very hard to keep those execution units busy--it's hard to find six independent instructions to issue every clock cycle. How Apple built a 6-wide CPU within that power envelope, and optimized the compiler to actually use that IPC is the really interesting question.

On a Xeon v3 core, SPECint averages below 2 instructions per cycle: https://www.researchgate.net/publication/322745869_A_Workloa.... How does Apple beat Intel on branch integer benchmarks like 403.gcc by a factor of two per clock?

Symmetry · on Oct 5, 2018

Lower maximum clock speeds mean you have more FO4s to play with and potentially makes the fan-out issues in wide designs a bit more manageable. Decode I expect to be pretty easy, as long as your branch predictor is on target the power costs just grow linearly with decode width when all your instructions start at nice 32-bit boundaries.

Mostly I'm curious about how complete the bypass network is on their functional units and if execution is clustered like the POWER8. The width doubling in the A series does remind me of the POWER 7 to 8 transition.

Renaming is also apparently a major constraint on design width in many cases but I'm not so familiar with that.

aw1621107 · on Oct 5, 2018

> more FO4s to play with

What does "FO4" stand for here? Googling it yields "Fallout 4", which definitely isn't right, and I'm not sure what other keywords to tack on to get the right result.

Symmetry · on Oct 5, 2018

Sorry, that a "fan-out of 4" which. Traditionally you look at circuit timings in terms of how long it takes one transistor to switch 4 other transistors of equal size. Wire capacitance is a lot more important these days so it's not necessarily the best metric anymore but it's still used. The fewer FO4s of delay you have in a pipeline stage the faster you can clock a chip. The fewer FO4s in your longest pipeline stage the faster you can clock your chip, though there's also a non-linear dependence on voltage. Because of that non-linearity I'd still expect a lower-clocked chip to have more complex pipeline stages. And you can only increase your speed by slimming down stages so much. The overhead of latches and accounting for clock jitter generally add 4 FO4s beyond the useful logic you accomplish in a pipeline stage.

aw1621107 · on Oct 5, 2018

Excellent explanation! Quick follow-up question: Why FO4 specifically, and not some other number/metric? Was (is?) that a particularly common structure in CPUs?

Symmetry · on Oct 5, 2018

I'm afraid you'll have to ask someone with a much grayer beard than mine to get a good answer to that.

Cyph0n · on Oct 5, 2018

As someone who is familiar with the term, great explanation!

mikeyouse · on Oct 5, 2018

From this rando PDF, it looks like "Fan-out of 4".

> Fan-out of 4 is a process-independent delay metric in CMOS tech.

The wrap-up slide at the very end has a bit of a rundown.

http://people.duke.edu/~bcl15/teachdir/ece590_fall14/Present...

amluto · on Oct 5, 2018

One respect in which I can imagine the x86 ISA being a real problem is in decode bandwidth. To issue 6 x86 instructions per cycle, either the front end needs to decode 6 per cycle, or it needs to cache decoded instructions. And x86 can’t be decoded in parallel without massive complexity because the instructions are variable length, and even determining the length requires mostly decoding the instruction.

rayiner · on Oct 5, 2018

It's true that decoding x86 is harder, but Sandy Bridge+ get most instructions from a uop cache, which delivers 4 fixed-length uops per cycle. You could make that 6 wide, but Intel doesn't because they wouldn't be able to fill that.

pcwalton · on Oct 5, 2018

AArch64 has a larger register file and fewer dependencies in general than x86-64 does. For example, most instructions don't set flags. I don't know for sure, but that might be enough to raise the ILP sufficiently.

AceJohnny2 · on Oct 5, 2018

> and optimized the compiler to actually use that IPC is the really interesting question.

You can check that part at least. Isn't it LLVM?

wintercharm · on Oct 5, 2018

They also have a 192 instruction reorder buffer, and a really well optimized OS.

baybal2 · on Oct 5, 2018

NetBurst... ?

stephencanon · on Oct 5, 2018

NetBurst was 4-wide with (IIRC) two of those ports double-pumped for certain operations, but limited to 4/cycle retire, I think.

rayiner · on Oct 5, 2018

NetBurst (like the P6 architecture before it) was 3-way decode/3-way retire. (Actually, NetBurst could decode just one x86 op per cycle into the trace cache; the trace cache could deliver 3 uops per cycle if there was a cache hit.)

mediocrejoker · on Oct 5, 2018

Just speculating but surely it must have something to do with the entire stack being designed under one roof, no? Having the kernel devs be able to walk across the campus and ask the hardware guys what a register is for must speed up development immensely.

aw1621107 · on Oct 5, 2018

I'm not sure that would explain plain CPU performance as shown here. I think SPEC compiles down to native code, so there should be only a fairly limited part of the whole stack involved here --- compiler guys for generating good code, kernel guys for scheduling/power management, SoC team for CPU/memory/GPU subsystem implementation.

I also thought kernel devs would be working against processor ISAs, not hardware-specific details beyond the ISA.

Symmetry · on Oct 5, 2018

To some extent. I seem to recall people saying that by restricting the range of page sizes they could make the L1 cache virtually indexed but physically tagged instead of physically indexed and tagged as Android phone processors are. That means you can start the lookup before the address translation is complete but still avoid aliasing problems.

EDIT: But I think another part of it is just being willing to throw more transistors at the problem than Android phone SoC manufacturers are and also that their higher income makes spending more on engineering make sense.

pertymcpert · on Oct 5, 2018

Once the CPU reaches peak frequency the OS shouldn't be involved for CPU benchmarks. So it's more down to hardware.

ken · on Oct 5, 2018

I can see how the compiler and stdlib (aren't things like memcpy implemented in hand-tuned assembly?) could be highly relevant, though.

pertymcpert · on Oct 5, 2018

They are, but that’s really low hanging fruit. If Android for example doesn’t use optimised memcpy implementations then they don’t deserve to exist as a serious OS.

astrange · on Oct 6, 2018

That reminds me, when I read fast code in the 90s-2000s all the asm hackers were into writing their own cool memcpy. Were they just showing off, or did Windows actually never optimize their standard library?

People still seem to like writing their own cool malloc, but memcpy not so much.

pertymcpert · on Oct 10, 2018

There are some cases where it may make sense to write your own implementation, if you have a niche microarchitecture that has unusual performance characteristics that the OS doesn't provide optimized routines for by default. But for most u-archs the default optimized routines should do a good job.

Things like malloc are quite a bit more complicated, and more workload dependent so there's still some opportunity specializing an implementation there.

oflannabhra · on Oct 5, 2018

The article gets into some technical aspects, but focuses specifically on A11->A12 changes.

Mostly, A-series chips have enormous L1 cache, great cache hierarchy and management, and very low memory latency. A12 specifically seems to have included an almost total redesign of the cache hierarchy.

I'm sure there are many more reasons their designs significantly outperform competitors, but I've not seen any more publicly available analysis.

aw1621107 · on Oct 5, 2018

> Mostly, A-series chips have enormous L1 cache, great cache hierarchy and management, and very low memory latency.

Has this always been the case compared to contemporary Qualcomm/Exynos/etc. SoCs? Not implying your statement is wrong here; all I know is that the A-series chips have had a big performance advantage for a while now and I haven't read more detailed analyses in the past that may have given hints as to why.

Also, how difficult is cache hierarchy/management to get right? For something as fundamental to good performance these days as cache, I would have expected the major players to be on more or less the same playing field.

baybal2 · on Oct 5, 2018

They are, but sram takes space, and thus money.

Apple optimises chip performance

Others optimize chip size

aw1621107 · on Oct 5, 2018

So if one of the other SoC manufacturers decided to match Apple's cache sizes, they should get similar performance? Or is that oversimplifying?

nicoburns · on Oct 5, 2018

I think part of it is that Apple aquired a very smart team of chip designers. That, and they probably give them a bigger budget probably explains most of the difference. Also, apple have pursued 2 (very) fast cores whereas other chips often have 4.

Of course, this iteration apple have beaten almost everyone else to 7nm, so the difference is much more dramatic.

baybal2 · on Oct 6, 2018

>I think part of it is that Apple aquired a very smart team of chip designers.

Yes, PA micro was the place where the "last of Mahicans" of US chip industry were.

>Also, apple have pursued 2 (very) fast cores whereas other chips often have 4.

Yes, because people into app development are as web developers, and the word "mutex" gives most of them a panic attack.

Android style java should've been more multi-threading friendly, but that does nothing about people not utilising them.

> this iteration apple have beaten almost everyone else to 7nm, so the difference is much more dramatic.

Yes. I remember how Mediatek beaten Apple to 10nm thanks to them being a Taiwanese company, but nevertheless "ruined it all" with their helio x30's design being designed with more marketing considerations than engineering ones. Their marketing guys couldn't wait to announce "hey we have 2 more cores than you Qualcomm!"

gniv · on Oct 5, 2018

> Has this always been the case compared to contemporary Qualcomm/Exynos/etc. SoCs?

Yes, that's my understanding. I've seen this discussion last year and the year before. And every time the answer seems to be caches. Cache memory is expensive. Apple seems willing to pay more for the SoC in order to have an overall experience that lets them get away with the high prices.

From what I read, Qualcomm would not be able to sell at volume an equivalently performant SoC.

aninteger · on Oct 5, 2018

Could it also be that a lot more of the applications are written and compiled to the native machine versus the overhead of a VM (even if that is JIT). I'd guess Minecraft pocket edition would perform similar on both systems.

aw1621107 · on Oct 5, 2018

Even if more things are being compiled to native code instead of Dalvik, I don't think that would explain the benchmark results here, as I think SPEC is always compiled to native code. It seems there's something more fundamental to Apple's hardware that is allowing for such insane performance.

As for Minecraft Pocket Edition, isn't that written in native code anyways? So I'd expect it to perform better on Apple hardware, assuming the hardware is actually the bottleneck for performance.

avar · on Oct 5, 2018

Dalvik hasn't been used on Android for more than 4 years (since Android 5): https://en.wikipedia.org/wiki/Dalvik_(software)

aw1621107 · on Oct 5, 2018

Looks like things are still compiled to Dalvik bytecode for distribution, but get recompiled to native code upon installation. I didn't know that; I don't follow Adnroid closely, so the particulars of the runtime aren't something I'm familiar with. Still, TIL. Thanks!

monocasa · on Oct 5, 2018

They added another JIT too. So first pass is the AOT installation binary, but it'll JIT traces too.

tyingq · on Oct 5, 2018

Some of it is likely just the high volume sales of a limited number of expensive phone models. They have the money to spend.

Android manufacturers have more competition, and have to address the low end of the market too. Their money, and attention, is spread in a wider swath.

sillyquiet · on Oct 5, 2018

I dunno if I buy this. Sure the spread of Android devices and manufacturers is wider across the price and cost spectrum, but there are 'high-end' Android phone manufacturer as well.

Samsung comes to mind, surely it's big enough to produce high-performance phones that can compete architecturally with Apple's SoC as well as addressing the developing nation / low-cost phone market?

babypuncher · on Oct 5, 2018

Right, but while Samsung's hardware team is focused on designing silicon for low, mid, and high-end devices throughout the year, Apple is building one chipset for two or three high end models every year. Apple just shoves last years models further down the price spectrum rather than launching new low and mid range devices every year.

The tighter focus, combined with the fact that Apple rakes in more cash to spend on R&D, is why their engineering team is able to win out here.

tyingq · on Oct 5, 2018

"But there are 'high-end' Android phone manufacturer as well"

Sure, yes, but not at a volume that allows them to create a processor that competes with iPhone only on their flagship model. Also, they can't charge $999 for their flagship. The average sales price for an iPhone is higher than the flagship model at Samsung.

chipotle_coyote · on Oct 5, 2018

Wouldn't you consider the flagship model at Samsung the Galaxy Note 9? Which starts at... $999?

scarface74 · on Oct 5, 2018

Even Samsung sales mostly lowend phones. The average selling price of their phones is still under $300.

https://www.gsmarena.com/analysts_average_selling_price_of_a...

From everything I can tell and looking at the numbers, the high end Android market is minuscule.

_dp9d · on Oct 5, 2018

Given Apple can get so far ahead of the competition with ARM for small devices... I still think it's entirely possible they're going to make their own CPUs for desktop Macs, and that is the massive hold up on the "New Mac Pro 2019(ish)"

ken · on Oct 5, 2018

The last time they changed architectures, they stuffed a new motherboard in an old case, to let developers get ready for the change.

I think it's unlikely they'll update their flagship Mac to a new CPU architecture without notifying developers first. On day one, all existing apps would perform poorly, and that's not how you promote a new top-of-the-line system.

tpetry · on Oct 5, 2018

Except they do it like their transition to Intel and emulate all software with the old architecture. But their new architecture would have to be very much better than intel. Or maybe they transpile x86/64 to their arm Rchitecture.

_dp9d · on Oct 6, 2018

Is there anything that stops them making their own x86-64 CPU?

astrange · on Oct 6, 2018

Intel and AMD's patent pool would have to expire for anyone else to jump in.

frou_dh · on Oct 5, 2018

They're looking to win back workstation users, not alienate them further with a swathe of software compatibility issues.

et2o · on Oct 6, 2018

I have not seen much evidence for this idea.

Someone · on Oct 5, 2018

They also write:

”This also gives us a great piece of context for Samsung’s M3 core, which was released this year […] Here the Exynos 9810 uses twice the energy over last year’s A11 – at a 55% performance deficit.”

Doesn’t that mean the A11 already is three times as efficient as recent Android cores? (The Samsung M3 was in January’s Hot Chips)