Don't forget that good cache locality also can cause data being pulled into cache that the prefetcher did know nothing about.
I can create you a shitty linked list that fits perfectly in L3, but still has terrible cold cache performance because each individual cacheline has to be pulled in one by one.
When working set is measured in megabytes, it fits in L3 cache of modern CPUs. Memory layout is not too important for these programs.
> most GC'd languages that aren't java have crappy GCs
C# is good too. It also has value types, native memory spans, SIMD intrinsics, and stackalloc.