They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.
The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip m3 for AI.
> if a llm will run with usable performance at that scale?
Yes.
The reason: MoE.
They are able to run at a good speed because they don't load all of the weights into the GPU cores.
For instance, DeepSeek R1 uses 404 GB in Q4 quantization[0], containing 256 experts of which 8 are routed to[1] (very roughly 13 GB per forward pass). With a memory bandwidth of 800 GB/s[3], the Mac Studio will be able to output 800/13 = 62 tokens per second.
You seem like you know what you are talking about... mind if I ask what your thoughts on quantization are? Its unclear to me if quantization affects quality... I feel like I've heard yes and no arguments
There is no question that quantization degrades quality. The GGUF R1 uses Q4_K_M, which, on Llama-3-8B, increases the perplexity by 0.18[0]. Many plots show increasing degradation as you quantize more[1].
That said, it is possible to train a model in a quantization-aware way[2][3], which improves the quality a bit, although not higher than the raw model.
Also, a loss in quality may not be perceptible in a specific use-case. Famously LMArena.ai tested Llama 3.1 405B with bf16 and fp8, and the latter was only 2 Elo points below, well within measurement error.
But if you don't already know the question your asking is not at all something I could distill down into a sentence or to that would make sense to a lay-person. Even then I know I couldn't distill it at all sorry.
I returned an M2 Max studio with 96GB RAM, unquantized llama 70B 3.1 was dog slow, not an interactive pace. I'm interested in offline LLM but couldn't see how it was going to produce $3,000 ROI.
It would be really cool if there was awebsite "we there yet" for reasonable offline AI.
It could track different hardware configurations and reasonably standardized benchmark performance per model. I know there's benchmarks buried in github Llama repository.
There seems to be a LOT of interest in such a site in the comments here. There seem to be multiple IP issues with sharing your code repo with an online service so I feel a lot of folks are waiting for the hardware to make this possible.
We need a SWE-bench for open source LLM's and for each model to have 3Dmark like benchmarks on various hardware setups.
I get why he calls it a simulator, as it can simulate token output. It's an important aspect for evaluating use case if you need to get a sense of how much token output is relevant beyond the simple tokens per second text.
Yeah, I don’t think RAM is the bottleneck. Which is unfortunate. It feels like a missed opportunity for them. I think Apple partly became popular because it enabled creatives and developers.
The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip m3 for AI.