> if a llm will run with usable performance at that scale?
Yes.
The reason: MoE.
They are able to run at a good speed because they don't load all of the weights into the GPU cores.
For instance, DeepSeek R1 uses 404 GB in Q4 quantization[0], containing 256 experts of which 8 are routed to[1] (very roughly 13 GB per forward pass). With a memory bandwidth of 800 GB/s[3], the Mac Studio will be able to output 800/13 = 62 tokens per second.
You seem like you know what you are talking about... mind if I ask what your thoughts on quantization are? Its unclear to me if quantization affects quality... I feel like I've heard yes and no arguments
There is no question that quantization degrades quality. The GGUF R1 uses Q4_K_M, which, on Llama-3-8B, increases the perplexity by 0.18[0]. Many plots show increasing degradation as you quantize more[1].
That said, it is possible to train a model in a quantization-aware way[2][3], which improves the quality a bit, although not higher than the raw model.
Also, a loss in quality may not be perceptible in a specific use-case. Famously LMArena.ai tested Llama 3.1 405B with bf16 and fp8, and the latter was only 2 Elo points below, well within measurement error.
But if you don't already know the question your asking is not at all something I could distill down into a sentence or to that would make sense to a lay-person. Even then I know I couldn't distill it at all sorry.
Yes.
The reason: MoE. They are able to run at a good speed because they don't load all of the weights into the GPU cores.
For instance, DeepSeek R1 uses 404 GB in Q4 quantization[0], containing 256 experts of which 8 are routed to[1] (very roughly 13 GB per forward pass). With a memory bandwidth of 800 GB/s[3], the Mac Studio will be able to output 800/13 = 62 tokens per second.
[0]: https://ollama.com/library/deepseek-r1
[1]: https://arxiv.org/pdf/2412.19437
[2]: https://www.apple.com/newsroom/2025/03/apple-unveils-new-mac...