Why do CPUs have multiple cache levels? (2016)

charleshn · 2024-05-16T17:26:23 1715880383

This answer by Peter Cordes is much more complete: https://stackoverflow.com/a/38549736

bonzini · 2024-05-16T23:18:27 1715901507

The main point is one: L1 cache, for speed, is limited to page size * associativity. This is what allows it to be virtually indexed (so you can do the TLB lookup and cache lookup in parallel) and physically tagged (so you don't need to invalidate the cache on every change to the page tables); but it means the index cannot use the bits that are translated by the page tables.

It's also why instruction and data caches are usually split at the L1 level. It's a bit more complex to keep them coherent but it almost doubles the size. Also these days it allows icache to store pre-decoded instructions, but that's secondary.

bobmcnamara · 2024-05-17T12:08:03 1715947683

I'll add a few more reason to split L1 icache and dcache:

1) limiting pollution, so memcpy doesn't purge L1 of instructions 2) simplifies interface between CPU and icache - only need to support a handful of instruction fetch block sizes CPU actually uses

wahern · 2024-05-16T22:08:06 1715897286

No answer is complete without getting into fundamental physics, information theory, and blackholes: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

TL;DR: Time complexity of memory access scales O(√N), not O(1), where N is your data set size. This applies even without hitting theoretical information density limits, whatever your best process size and latency. For optimal performance on any particular task you'll always need to consider locality of your working data set, i.e. memory hierarchy, caching, etc. And to be clear, the implication is that temporal and spatial locality are fundamentally related.

Yes, some cache levels have lower intrinsic latency than others, but at scale (even at-home, DIY, toy project scales these days) it's simpler to think of that as a consequence of space-time physics, not a property of manufacturing processes, packaging techniques, or algorithms. This is liberating in the sense that you can predict and design against both today's and tomorrow's systems without fixating on specific technology or benchmarks.

7e · 2024-05-17T04:46:30 1715921190

√N because we’re not yet 3D. Move to a sphere and it improves to cube root.

bumbledraven · 2024-05-17T05:27:45 1715923665

> Move to a sphere and it improves to cube root.

No, it doesn't. Due to the Bekenstein bound, the amount of matter — and hence information — that can be stored in a sphere is ultimately proportional to surface area, not to volume. This is covered in part 2 of the article: https://www.ilikebigbits.com/2014_04_28_myth_of_ram_2.html

saurik · 2024-05-17T09:25:15 1715937915

This is only as you near the theoretical limit of density, though, right? At easy-to-achieve densities that we would be using in the foreseeable future, if we plotted their access times in these different configurations, wouldn't it--practically speaking--still scale at the cube root?

I guess that was covered in the comment.

> This applies even without hitting theoretical information density limits, whatever your best process size and latency.

But now I guess I don't understand why... :(. I feel like we can almost trivially show this isn't true by starting with a downright primitive "best process size"--maybe a system where every bit is so large I can see it and access it using a pair of tweezers--but maybe I am fooling myself?

aragonite · 2024-05-15T10:59:23 1715770763

Discussion at the time (121 comments):

https://news.ycombinator.com/item?id=12245458

jeffbee · 2024-05-16T16:59:53 1715878793

This hardly explains anything. It doesn't talk about the tags, how loads and stores actually work, and how much each kind of cache costs. On most Intel CPUs for example the L1 cache is physically larger than the L2 cache, despite being logically 16x smaller. It just costs more to make a faster cache. It has more wires and gates. So that's why we have different kinds of cache memory: money.

Symmetry · 2024-05-16T19:28:53 1715887733

Are you including address translation in the area you're ascribing to the L1$ there? I haven't looked at detailed area numbers for recent Intel designs but having equal area for L1$ and L2$ seems really weird to me based on numbers from other designs I've seen.

I'm having a hard time mentally come up with a way a larger L1$ could be faster. Have more ways, sure. Or more read ports. Or more bandwidth. And I'm given to understand that Intel tags (or used to tag) instruction boundaries in their L1I$. But how do you reduce latency? Physically larger transistors can more quickly overcome the capacitance of the lines their destination and the capacitance of their destinations but a large cache makes the distance and line capacitance correspondingly larger. You can have speed differences with 6T versus 8T SRAM cell designs but as I understand it Intel went to use 8T everywhere in Nahelem for energy efficiency reasons. I guess maybe changes in transistor technology could have made them revisit that, but 8 transistors isn't that much physically larger than 6.

But in general there are a lot of complicated things about how the memory subsystem of a CPU work that are important to performance, add add a floor to how low a first level cache's latency can be, but they don't really contradict anything that was said in the article.

koolba · 2024-05-16T23:20:23 1715901623

> I'm having a hard time mentally come up with a way a larger L1$ could be faster.

It’s faster because you’re comparing a two completely different physical implementations. SRAM does not use capacitors. DRAM does. You trade speed for density.

gsliepen · 2024-05-17T08:50:09 1715935809

Except that L2 cache is also SRAM. In fact, usually all RAM inside the CPU is SRAM, because the process for making DRAM is not very compatible with the process for making SRAM and the rest of the CPU. We are only just recently been seeing devices which merge DRAM and a CPU into one package, but internally it is still on different chips (for example, the Apple M1 processor).

bobmcnamara · 2024-05-17T11:43:55 1715946235

eDRAM L4 has been a thing in some Intel designs, but other than that, you're absolutely right

Symmetry · 2024-05-17T12:39:24 1715949564

Also a few IBM designs. An interesting thing is that IBM claimed that eDRAM was faster than SRAM for their particular design because at the scale they were using it the reduced size of the array increased speed more than eDRAM's access time decreased it.

maksut · 2024-05-16T17:52:04 1715881924

That is interesting. I wonder if L1 is denser because it has to have more bandwidth. But doesn't that point to a space constraint rather than money? A combination of L1 & L2 will have a bigger capacity so it would be faster than pure L1 cache in the same space (for some/most workloads)?

I always thought cache layers was because of locality but that is my imagination :) The article talks about different access patterns of the cache layers which makes sense in my mind.

It also mentions density briefly:

> Only what misses L1 needs to be passed on to higher cache levels, which don’t need to be nearly as fast, nor have as much bandwidth. They can worry more about power efficiency and density instead.

bhaney · 2024-05-16T18:12:57 1715883177

> doesn't that point to a space constraint rather than money?

The space constraints are also caused by money. The reason we don't just add more L1 cache is that it would take up a lot of space, necessitating larger dies, which lowers yields and significantly increases the cost per chip.

IcyWindows · 2024-05-16T18:30:07 1715884207

I would say it's physics, not money.

Space constraints are caused by power and latency limits even with infinite money.

Symmetry · 2024-05-16T19:35:17 1715888117

That isn't true at all. The limited speed at which a signal can propagate itself across a chip and the added levels of muxing necessarily mean that there's a limit to how low the latency of a cache can be that's roughly proportional to the square root of its capacity.

bhaney · 2024-05-16T20:16:29 1715890589

It actually is true. You're also right that physics would eventually constrain die size, but it isn't the bottleneck that's keeping CPUs at their typical current size. This should be pretty obvious from the existence of (rare and expensive) specialty CPU dies that are much larger than typical ones. They're clearly not physically impossible to produce, just very expensive. The current bottleneck holding back die sizes is in fact costs, since larger die sizes cause the inevitable blemishes to ruin larger chunks of your silicon wafer each, cratering yields.

> added levels of muxing necessarily mean that there's a limit to how low the latency of a cache can be

L1 cache avoids muxing as much as possible, which is why it takes up so much die space in the first place.

Tuna-Fish · 2024-05-16T23:17:08 1715901428

The path of loading data from L1 is one of the tightest, most timing-critical parts of a modern CPU. Every cycle of latency here has very clear, measurable impact on performance, and modern designs typically have 4-5 cycle L1 load-to-use. Current AMD cores do really well against Intel ones despite clocking lower and being weaker on most types of resources simply because they have a 1 cycle advantage. If you had literally infinite cheap transistors available, it would not be a good idea to spend them on the L1 cache, because this would make the cpu slower.

> L1 cache avoids muxing as much as possible, which is why it takes up so much die space in the first place.

Every time you double the size of a cache, you need to add a single extra mux on the access path. Simply to be able to select from which half of the cache you want the result. You also increase the distance that a signal needs to propagate, but I believe for L1 the muxes dominate.

jeffbee · 2024-05-16T18:28:26 1715884106

Also it draws a huge amount of power.

nomel · 2024-05-16T17:50:04 1715881804

Related, here's a neat discussion about how this poisons Big O notation: https://news.ycombinator.com/item?id=38337989

hi-v-rocknroll · 2024-05-16T22:04:13 1715897053

Big O notation is theoretical worst-case analysis of runtime or space that rarely/never maps to actual performance. It's nice to wax about, but what really matters is quality benchmark data of actual code ran on real systems.

VS1999 · 2024-05-16T22:42:50 1715899370

It's overvalued but still useful for the average person to have an easy way to think about if they're about to run 1000 operations or 1000^3 operations on something.

nomel · 2024-05-17T17:19:22 1715966362

Well, that’s actually the what the article talks about: it’s the theoretical best worst-case, and it rarely/never maps to reality partly because of that.

jauntywundrkind · 2024-05-16T16:36:58 1715877418

We do the same things with decoupling capacitors. Rather than put all our store in one unit, we have multiple stores of differing size to buffer different load profiles.

5ADBEEF · 2024-05-16T17:38:16 1715881096

I am not certain that this provides a lot of benefit. Worth discussion. Feels like most engineers just copy the application circuit nowadays though. https://www.signalintegrityjournal.com/articles/1589-the-myt...

hi-v-rocknroll · 2024-05-16T22:01:58 1715896918

Sigh. Memory/storage hierarchy vs. cost tradeoffs is why.

Tuna-Fish · 2024-05-16T23:07:54 1715900874

Cost plays no role.

Cost of bits at L2 and L3 are more or less the same. They are typically implemented using the same kind of transistors.

The reason is latency, plain and simple. Making a pool of memory larger always makes it slower to access, so to have a very fast pool you need it to be very small. The optimal solution is a hierarchical stack of pools of increasing size and latency. And if you had infinite free transistors, it would still be the same.

credit_guy · 2024-05-16T23:35:07 1715902507

I found this great link from Intel [1]. Apologies for my poor formatting of the table.

| Memory _| Size __| Latency| Bandwidth

| L1 cache | 32 KB _| 1 ns _| 1 TB/s

| L2 cache | 256 KB | 4 ns _| 1 TB/s

| L3 cache | 8 MB+ | 10x L2 | >400 GB/s

| MCDRAM | ______| 2x L3 | 400 GB/s

| DDR DIMMs| 4 GB-1 TB| 2x L3 | 100 GB/s

[1] https://www.intel.com/content/www/us/en/developer/articles/t...

hi-v-rocknroll · 2024-05-17T02:59:26 1715914766

> Cost plays no role.

0. You're not going to have 32 GiB of SRAM anytime soon.

1. Amount of L3 translates into area which means cost. 1152 MiB of L3 on an 9684X ain't fucking free.

Tuna-Fish · 2024-05-17T08:57:09 1715936229

None of that is what we are talking about. The question is, why is there a small L1, medium L2 and a large L3? Why not just have it all be L1?

And cost plays no role in those decisions. Making the L1 or L2 larger would make them too slow. Only the size of L3 is limited by cost.

tdsanchez · 2024-05-16T20:42:46 1715892166

CPUs have multiple cache levels because the machine cycle at the CPU die is ~500ps while writing to main memory and then need to read it at the same latency, that’s going to be around 200ns while the CPU is idle.

To mask this, we write back to cache and rely on cache coherency algorithms and multiway, multilevel caches to make sure main memory is written back to and read when cache tags are invalidated.

tl;dr - Current process technologies make SRAM very much faster than DRAM and multiple levels of multiway caches create a time based interface to maximise memory throughput to the CPU regsisters while maintaining coherent memory write backs.

It’s worth noting that Apple Silicon is fast because their DRAM bandwidth is much closer to the same machine cycle latency as the APU cores’caches and registers.

neglesaks · 2024-05-16T19:17:00 1715887020

Working set size, latency and pesky laws of economics.