You usually don't even need AVX512 to sustain enough load/stores at the core to ...

You usually don't even need AVX512 to sustain enough load/stores at the core to max out memory bandwidth "in theory": even with 256 bit loads and assuming 2/1 loads/stores per cycle (ICL/Zen 3 and newer can do more), that's 256 GB/s of read bandwidth or 128 GB/s write bandwidth (or both, at once!) at 4 GHz.

Indeed, you can reach these numbers if you always hit in L1 and come close if you always hit in L2. The load number especially is higher than almost any single socket bandwidth until perhaps very recently*: an 8-channel chip with the fastest DDR4-3200 would get 25.6 x 8 = 204.8 GB/s max theoretical bandwidth. Most chips have fewer channels and lower max theoretical bandwidth.

However, and as a sibling comment alludes to, you generally cannot in practice sustain enough outstanding misses from a single core to actually achieve this number. E.g., with 16 outstanding misses and 100 ns latency per cache line you can only demand fetch at ~10 GB/s from on core. Actually numbers are higher due to prefetching, which both decreases the latency (since the prefetch is initiated from a component closer to the memory controller) and makes more outstanding misses available (since there are more miss buffers from L2 than there are from the core) but this to only roughly double the bandwidth: it's hard to get more than 20-30 GB/s from a single core on Intel.

This isn't a fundamental limitation which applies to every CPU however: Apple chips can extract the entire bandwidth from a single core, despite having much smaller 128-bit (perhaps 256-bit if you consider load pair) load and store instructions.

---

* Not really sure about this one: are there 16-channel DDR5 setups out there yet (16 DDR5 channels corresponds to 8 independent DIMMS so is similar to an 8-channel DDR4 setup as DDR5 has 2x channels per DIMM)?