The benefits don't really have that much to do with ARM specifically, more that Apple can tune their silicon designs for their very niche applications, while ARM Holdings designs and other silicon IP designers have to design much more generally as they're selling to a broader customer base integrating the cores into everything from smartphones to edge switches.
I'm not sure to what extent they've done so, but with their own silicon Apple has the freedom to simply not implement the unnecessary optional bits of ARM, like 32 bit support, and the optional extensions to the ISA not applicable to desktop class general purpose compute.
The saved space in transistor budget could allow them to save precious time and energy in performance critical spaces like the instruction decoder/fetcher.
No idea what they're actually doing (they're not exactly very open about their designs) but with their own processors they're free to optimize for a specific use case and a specific kernel. A specific example of this is how they've reduced the time spent in garbage collection by a factor of 3-5 IIRC, which has pretty dramatic ramifications for both performance and memory usage (as you can do GC quicker and more often)
I counted ARC as a sort of GC, though I've heard the opinion it is and isn't a type of GC, and see both sides' points
I love that article and the research behind it, but it's worth pointing out the details are sort of "reverse engineered" by probing XNU and running some diagnostic programs, not from apple supplied docs. I doubt apple will ever directly document it like IBM or Intel document their CPUs, though I hope I'm wrong!
It's a great piece and as you say it's a shame that we won't see more detailed info from Apple but I've seen a couple of interesting comments (leaks?) from Apple employees with more details (e.g. the tweets embedded in the article below). I suspect that we'll know an awful lot about the M1 by the time we've finished!
I'm not sure to what extent they've done so, but with their own silicon Apple has the freedom to simply not implement the unnecessary optional bits of ARM, like 32 bit support, and the optional extensions to the ISA not applicable to desktop class general purpose compute.
The saved space in transistor budget could allow them to save precious time and energy in performance critical spaces like the instruction decoder/fetcher.
No idea what they're actually doing (they're not exactly very open about their designs) but with their own processors they're free to optimize for a specific use case and a specific kernel. A specific example of this is how they've reduced the time spent in garbage collection by a factor of 3-5 IIRC, which has pretty dramatic ramifications for both performance and memory usage (as you can do GC quicker and more often)