Are you are referring to latency due to propagation delay where the worst case increases as you scale?
Would you mind elaborating a bit? I'm not following how this would significantly close the gap between SRAM and DRAM at 1GB. Since an SRAM cell itself is generally faster than a DRAM cell, and I understand that circuitry beyond an SRAM cell itself is far simpler than DRAM. Am I missing something?
Think of a circular library with a central atrium and bookshelves arranged in circles radiating out from the atrium. In the middle of the atrium you have your circular desk. You can put books on your desk to save yourself the trouble of having to go get them off the shelves. You can also move books to shelves that are closer to the atrium so they're quicker to get than the ones farther away.
So what's the problem? Well, your desk is the fastest place you can get books from but you clearly can't make your desk the size of the entire library, as that would defeat the purpose. You also can't move all of the books to the innermost ring of shelves, since they won't fit. The closer you are to the central atrium, the smaller the bookshelves. Conversely, the farther away, the larger the bookshelves.
Circuits don't follow this ideal model of concentric rings, but I think it's a nice rough approximation for what's happening here. It's a problem of geometry, not a problem of physics, and so the limitation is even more fundamental than the laws of physics. You could improve things by going to 3 dimensions, but then you would have to think about how to navigate a spherical library, and so the analogy gets stretched a bit.
Area is a big one. Why isn't L1 MB? Because you can't put that much data close enough to the core.
Look at a Zen-based EPYC core- 32KB of L1 with 4 cycle latency, 512KB of L2 with 12 cycle latency, 8MB of L3 with 37 cycle latency.
L1 to L2 is 3x slower for 8x more memory, L2 to L3 is 3x slower for 16x more memory.
You can reach 9x more area in 3x more cycles, so you can see how the cache scaling is basically quadratic (there's a lot more execution machinery competing for area with L1/L2, so it's not exact).
I am sure there are many factors, but the most basic one is that the more memory you have, the longer it takes to address that memory. I think it scales with the log of the ram size, which is linearly with the number of address bits.
Log-depth circuits are a useful abstraction but the constraints of laying out circuits in physical space imposes a delay scaling limit of O(n^(1/2)) for planar circuits (with a bounded number of layers) and O(n^(1/3)) for 3D circuits. The problem should be familiar to anyone who's drawn a binary tree on paper.
With densities so high, and circuit boards so small (when they want to be), that factor isn't very important here.
We regularly use chips with an L3 latency around 10 nanoseconds, going distances of about 1.5 centimeters. You can only blame a small fraction of a nanosecond on the propagation delays there. And let's say we wanted to expand sideways, with only a 1 or 2 nanosecond budget for propagation delays. With a relatively pessimistic assumption of signals going half the speed of light, that's a diameter of 15cm or 30cm to fit our SRAM into. That's enormous.
so ram is your 1gb cache