The main point is one: L1 cache, for speed, is limited to page size * associativity. This is what allows it to be virtually indexed (so you can do the TLB lookup and cache lookup in parallel) and physically tagged (so you don't need to invalidate the cache on every change to the page tables); but it means the index cannot use the bits that are translated by the page tables.
It's also why instruction and data caches are usually split at the L1 level. It's a bit more complex to keep them coherent but it almost doubles the size. Also these days it allows icache to store pre-decoded instructions, but that's secondary.
I'll add a few more reason to split L1 icache and dcache:
1) limiting pollution, so memcpy doesn't purge L1 of instructions
2) simplifies interface between CPU and icache - only need to support a handful of instruction fetch block sizes CPU actually uses
TL;DR: Time complexity of memory access scales O(√N), not O(1), where N is your data set size. This applies even without hitting theoretical information density limits, whatever your best process size and latency. For optimal performance on any particular task you'll always need to consider locality of your working data set, i.e. memory hierarchy, caching, etc. And to be clear, the implication is that temporal and spatial locality are fundamentally related.
Yes, some cache levels have lower intrinsic latency than others, but at scale (even at-home, DIY, toy project scales these days) it's simpler to think of that as a consequence of space-time physics, not a property of manufacturing processes, packaging techniques, or algorithms. This is liberating in the sense that you can predict and design against both today's and tomorrow's systems without fixating on specific technology or benchmarks.
No, it doesn't. Due to the Bekenstein bound, the amount of matter — and hence information — that can be stored in a sphere is ultimately proportional to surface area, not to volume. This is covered in part 2 of the article: https://www.ilikebigbits.com/2014_04_28_myth_of_ram_2.html
This is only as you near the theoretical limit of density, though, right? At easy-to-achieve densities that we would be using in the foreseeable future, if we plotted their access times in these different configurations, wouldn't it--practically speaking--still scale at the cube root?
I guess that was covered in the comment.
> This applies even without hitting theoretical information density limits, whatever your best process size and latency.
But now I guess I don't understand why... :(. I feel like we can almost trivially show this isn't true by starting with a downright primitive "best process size"--maybe a system where every bit is so large I can see it and access it using a pair of tweezers--but maybe I am fooling myself?
This hardly explains anything. It doesn't talk about the tags, how loads and stores actually work, and how much each kind of cache costs. On most Intel CPUs for example the L1 cache is physically larger than the L2 cache, despite being logically 16x smaller. It just costs more to make a faster cache. It has more wires and gates. So that's why we have different kinds of cache memory: money.
Are you including address translation in the area you're ascribing to the L1$ there? I haven't looked at detailed area numbers for recent Intel designs but having equal area for L1$ and L2$ seems really weird to me based on numbers from other designs I've seen.
I'm having a hard time mentally come up with a way a larger L1$ could be faster. Have more ways, sure. Or more read ports. Or more bandwidth. And I'm given to understand that Intel tags (or used to tag) instruction boundaries in their L1I$. But how do you reduce latency? Physically larger transistors can more quickly overcome the capacitance of the lines their destination and the capacitance of their destinations but a large cache makes the distance and line capacitance correspondingly larger. You can have speed differences with 6T versus 8T SRAM cell designs but as I understand it Intel went to use 8T everywhere in Nahelem for energy efficiency reasons. I guess maybe changes in transistor technology could have made them revisit that, but 8 transistors isn't that much physically larger than 6.
But in general there are a lot of complicated things about how the memory subsystem of a CPU work that are important to performance, add add a floor to how low a first level cache's latency can be, but they don't really contradict anything that was said in the article.
> I'm having a hard time mentally come up with a way a larger L1$ could be faster.
It’s faster because you’re comparing a two completely different physical implementations. SRAM does not use capacitors. DRAM does. You trade speed for density.
Except that L2 cache is also SRAM. In fact, usually all RAM inside the CPU is SRAM, because the process for making DRAM is not very compatible with the process for making SRAM and the rest of the CPU. We are only just recently been seeing devices which merge DRAM and a CPU into one package, but internally it is still on different chips (for example, the Apple M1 processor).
Also a few IBM designs. An interesting thing is that IBM claimed that eDRAM was faster than SRAM for their particular design because at the scale they were using it the reduced size of the array increased speed more than eDRAM's access time decreased it.
That is interesting. I wonder if L1 is denser because it has to have more bandwidth. But doesn't that point to a space constraint rather than money? A combination of L1 & L2 will have a bigger capacity so it would be faster than pure L1 cache in the same space (for some/most workloads)?
I always thought cache layers was because of locality but that is my imagination :) The article talks about different access patterns of the cache layers which makes sense in my mind.
It also mentions density briefly:
> Only what misses L1 needs to be passed on to higher cache levels, which don’t need to be nearly as fast, nor have as much bandwidth. They can worry more about power efficiency and density instead.
> doesn't that point to a space constraint rather than money?
The space constraints are also caused by money. The reason we don't just add more L1 cache is that it would take up a lot of space, necessitating larger dies, which lowers yields and significantly increases the cost per chip.
That isn't true at all. The limited speed at which a signal can propagate itself across a chip and the added levels of muxing necessarily mean that there's a limit to how low the latency of a cache can be that's roughly proportional to the square root of its capacity.
It actually is true. You're also right that physics would eventually constrain die size, but it isn't the bottleneck that's keeping CPUs at their typical current size. This should be pretty obvious from the existence of (rare and expensive) specialty CPU dies that are much larger than typical ones. They're clearly not physically impossible to produce, just very expensive. The current bottleneck holding back die sizes is in fact costs, since larger die sizes cause the inevitable blemishes to ruin larger chunks of your silicon wafer each, cratering yields.
> added levels of muxing necessarily mean that there's a limit to how low the latency of a cache can be
L1 cache avoids muxing as much as possible, which is why it takes up so much die space in the first place.
The path of loading data from L1 is one of the tightest, most timing-critical parts of a modern CPU. Every cycle of latency here has very clear, measurable impact on performance, and modern designs typically have 4-5 cycle L1 load-to-use. Current AMD cores do really well against Intel ones despite clocking lower and being weaker on most types of resources simply because they have a 1 cycle advantage. If you had literally infinite cheap transistors available, it would not be a good idea to spend them on the L1 cache, because this would make the cpu slower.
> L1 cache avoids muxing as much as possible, which is why it takes up so much die space in the first place.
Every time you double the size of a cache, you need to add a single extra mux on the access path. Simply to be able to select from which half of the cache you want the result. You also increase the distance that a signal needs to propagate, but I believe for L1 the muxes dominate.
Big O notation is theoretical worst-case analysis of runtime or space that rarely/never maps to actual performance. It's nice to wax about, but what really matters is quality benchmark data of actual code ran on real systems.
It's overvalued but still useful for the average person to have an easy way to think about if they're about to run 1000 operations or 1000^3 operations on something.
Well, that’s actually the what the article talks about: it’s the theoretical best worst-case, and it rarely/never maps to reality partly because of that.
We do the same things with decoupling capacitors. Rather than put all our store in one unit, we have multiple stores of differing size to buffer different load profiles.
Cost of bits at L2 and L3 are more or less the same. They are typically implemented using the same kind of transistors.
The reason is latency, plain and simple. Making a pool of memory larger always makes it slower to access, so to have a very fast pool you need it to be very small. The optimal solution is a hierarchical stack of pools of increasing size and latency. And if you had infinite free transistors, it would still be the same.
CPUs have multiple cache levels because the machine cycle at the CPU die is ~500ps while writing to main memory and then need to read it at the same latency, that’s going to be around 200ns while the CPU is idle.
To mask this, we write back to cache and rely on cache coherency algorithms and multiway, multilevel caches to make sure main memory is written back to and read when cache tags are invalidated.
tl;dr - Current process technologies make SRAM very much faster than DRAM and multiple levels of multiway caches create a time based interface to maximise memory throughput to the CPU regsisters while maintaining coherent memory write backs.
It’s worth noting that Apple Silicon is fast because their DRAM bandwidth is much closer to the same machine cycle latency as the APU cores’caches and registers.