*> How much faster is GPU access to the unified memory model on the new Macs/how...

ComputerGuru · on Dec 11, 2023

I don't think the numbers sufficiently capture the limitation. The Intel memory bandwidth speed you quoted would be for CPU-based inference, but not for gpu inference using shared system memory for spillover model size past the dedicated gpu vram. I think that would necessarily limit parts of the inference procedure (not sure how the split would work, and it would probably depend on whether you're using something like flash attention or not) to the available PCI-e 3.0 or 4.0 available bandwidth, as the gpu needs to communicate over the PCIe bus then over the chipset memory bus.

A GPU connected to a PCIe 3.0 x16 electrical uplink would be constrained to ~16GB/s, or ~32GB/s if it were a PCIe 4.0 uplink instead. Although those numbers imply slower bandwidth than CPU inference, that bottleneck would only be when paging in or out (or directly accessing?) layers overflowed to the shared system ram, so they don't really represent much on their own.

bdcs · on Dec 11, 2023

Excellent comparison. However, I am confused by

> no more than two cards per PC

I've seen quad 4090 builds, e.g. here[0]. What do you mean no more than two cards? Yes, power is definitely an issue with multiple 4090s, though you can limit the max power using `nvidia-smi`, which IME doesn't hurt (mem-bottlenecked) inference.

[0] https://old.reddit.com/r/watercooling/comments/16ed8fu/quad_...

a_wild_dandan · on Dec 11, 2023

Apple M2 Ultra: "up to 192GB of memory with 800GB/s of unified memory bandwidth for workstation-class performance."

ComputerGuru · on Dec 11, 2023

So M2 is more advanced than M3?

narism · on Dec 12, 2023

For memory bandwidth at the lower tiers, yeah. M3 Max still has 400GB/s and since the M2 Ultra (800GB/s) is just two M2 Maxes glued together (400GB/s each), the eventual M3 Ultra should be comparable.

ComputerGuru · on Dec 13, 2023

That’s like ThreadRipper, thanks for the info. That’s the bandwidth from cpu to memory controllers, is there really no bottleneck to the iGPU?