This answer by Peter Cordes is much more complete: https://stackoverflow.com/a/3...

bonzini · on May 16, 2024

The main point is one: L1 cache, for speed, is limited to page size * associativity. This is what allows it to be virtually indexed (so you can do the TLB lookup and cache lookup in parallel) and physically tagged (so you don't need to invalidate the cache on every change to the page tables); but it means the index cannot use the bits that are translated by the page tables.

It's also why instruction and data caches are usually split at the L1 level. It's a bit more complex to keep them coherent but it almost doubles the size. Also these days it allows icache to store pre-decoded instructions, but that's secondary.

bobmcnamara · on May 17, 2024

I'll add a few more reason to split L1 icache and dcache:

1) limiting pollution, so memcpy doesn't purge L1 of instructions 2) simplifies interface between CPU and icache - only need to support a handful of instruction fetch block sizes CPU actually uses

wahern · on May 16, 2024

No answer is complete without getting into fundamental physics, information theory, and blackholes: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

TL;DR: Time complexity of memory access scales O(√N), not O(1), where N is your data set size. This applies even without hitting theoretical information density limits, whatever your best process size and latency. For optimal performance on any particular task you'll always need to consider locality of your working data set, i.e. memory hierarchy, caching, etc. And to be clear, the implication is that temporal and spatial locality are fundamentally related.

Yes, some cache levels have lower intrinsic latency than others, but at scale (even at-home, DIY, toy project scales these days) it's simpler to think of that as a consequence of space-time physics, not a property of manufacturing processes, packaging techniques, or algorithms. This is liberating in the sense that you can predict and design against both today's and tomorrow's systems without fixating on specific technology or benchmarks.

7e · on May 17, 2024

√N because we’re not yet 3D. Move to a sphere and it improves to cube root.

bumbledraven · on May 17, 2024

> Move to a sphere and it improves to cube root.

No, it doesn't. Due to the Bekenstein bound, the amount of matter — and hence information — that can be stored in a sphere is ultimately proportional to surface area, not to volume. This is covered in part 2 of the article: https://www.ilikebigbits.com/2014_04_28_myth_of_ram_2.html

saurik · on May 17, 2024

This is only as you near the theoretical limit of density, though, right? At easy-to-achieve densities that we would be using in the foreseeable future, if we plotted their access times in these different configurations, wouldn't it--practically speaking--still scale at the cube root?

I guess that was covered in the comment.

> This applies even without hitting theoretical information density limits, whatever your best process size and latency.

But now I guess I don't understand why... :(. I feel like we can almost trivially show this isn't true by starting with a downright primitive "best process size"--maybe a system where every bit is so large I can see it and access it using a pair of tweezers--but maybe I am fooling myself?