Yeah, sorry, I realized that as well so I edited my post to add a higher end example with multiple 3090s or similar cards. A single 3090 has just under 1TiB/sec of memory bandwidth.
One more edit: I'd also like to point out that memory bandwidth is important, but not sufficient for fast inference. My entire point here is that Apple silicon does have high memory bandwidth for sure, but for inference it's very much held back by the relative slowness of the GPU compared with dedicated nVidia/AMD cards.
It's still "fast enough" for even 120b models in practice, and you don't need to muck around with building a multi-GPU rig (and figuring out how to e.g. cool it properly).
It's definitely not what you'd want for your data center, but for home tinkering it has a very clear niche.
> It's still "fast enough" for even 120b models in practice
Is it? This is very subjective. The Mac Studio would not be "fast enough" for me on even a 70b model, not necessarily because its output is slow, but because the prompt evaluation speed is quite bad. See [0] for example numbers; on Llama 3 70B at Q4_K_M quantization, it takes an M2 Ultra with 192GB about 8.5 seconds just to evaluate a 1024-token prompt. A machine with 6 3090s (which would likely come in cheaper than the Mac Studio) is over 6 times faster at prompt parsing.
A 120b model is likely going to be something like 1.5-2x slower at prompt evaluation, rendering it pretty much unusable (again, for me).
One more edit: I'd also like to point out that memory bandwidth is important, but not sufficient for fast inference. My entire point here is that Apple silicon does have high memory bandwidth for sure, but for inference it's very much held back by the relative slowness of the GPU compared with dedicated nVidia/AMD cards.