> The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower bound time.
It doesn't. Full scans are faster than accessing each memory address in an unordered way.
Let's look at a Ryzen 2600X. You can sustain 32 bytes per second from L1, 32 bytes per second from L2, and 20 bytes per cycle from L3. That's 64KB, 512KB, and 16MB caches all having almost the same bandwidth despite very different latencies.
You can also imagine an infiniband network that fills 2 racks, and another one that fills 50000 racks. The bandwidth of a single node is the same in both situations, so even though latency gets worse as you add more nodes and hops, it's going to take exactly O(n) time for a single thread to scan the entire memory.
You can find correlations between memory size and bandwidth, but they're significantly weaker and less consistent than the correlations between memory size and latency.
Once you start receiving data in bulk, the time it takes is quantity of data divided by your connection speed. Latency doesn't factor in.
Technically you need to consider the time it takes to start receiving data. Which would mean your total time is O(n + ∛n). Not O(n * ∛n). But not all nodes are ∛n away. The closest nodes are O(1) latency. So if you start your scan on the close nodes, you will keep your data link saturated from the start, and your total time will be O(n). (And O(n + ∛n) simplifies to O(n) anyway.)
It doesn't. Full scans are faster than accessing each memory address in an unordered way.
Let's look at a Ryzen 2600X. You can sustain 32 bytes per second from L1, 32 bytes per second from L2, and 20 bytes per cycle from L3. That's 64KB, 512KB, and 16MB caches all having almost the same bandwidth despite very different latencies.
You can also imagine an infiniband network that fills 2 racks, and another one that fills 50000 racks. The bandwidth of a single node is the same in both situations, so even though latency gets worse as you add more nodes and hops, it's going to take exactly O(n) time for a single thread to scan the entire memory.
You can find correlations between memory size and bandwidth, but they're significantly weaker and less consistent than the correlations between memory size and latency.