They didn't increase the memory bandwidth. You can get the same memory bandwidth...

espadrine · 2025-03-05T16:13:17 1741191197

> if a llm will run with usable performance at that scale?

Yes.

The reason: MoE. They are able to run at a good speed because they don't load all of the weights into the GPU cores.

For instance, DeepSeek R1 uses 404 GB in Q4 quantization[0], containing 256 experts of which 8 are routed to[1] (very roughly 13 GB per forward pass). With a memory bandwidth of 800 GB/s[3], the Mac Studio will be able to output 800/13 = 62 tokens per second.

[0]: https://ollama.com/library/deepseek-r1

[1]: https://arxiv.org/pdf/2412.19437

[2]: https://www.apple.com/newsroom/2025/03/apple-unveils-new-mac...

_aavaa_ · 2025-03-05T16:33:00 1741192380

This doesn’t sound correct.

You don’t know which expert you’ll need for each layer, so you either keep them all loaded in memory or stream them from disk

espadrine · 2025-03-05T16:39:43 1741192783

In RAM, yes. But if you compute an activation, you need to load the weights from RAM to the GPU core.

_aavaa_ · 2025-03-05T16:44:28 1741193068

Got you, yeah I misread you commend the first time around

kgwgk · 2025-03-05T16:36:11 1741192571

Note that 404 < 512

fullstackchris · 2025-03-05T16:20:14 1741191614

You seem like you know what you are talking about... mind if I ask what your thoughts on quantization are? Its unclear to me if quantization affects quality... I feel like I've heard yes and no arguments

espadrine · 2025-03-05T16:38:31 1741192711

There is no question that quantization degrades quality. The GGUF R1 uses Q4_K_M, which, on Llama-3-8B, increases the perplexity by 0.18[0]. Many plots show increasing degradation as you quantize more[1].

That said, it is possible to train a model in a quantization-aware way[2][3], which improves the quality a bit, although not higher than the raw model.

Also, a loss in quality may not be perceptible in a specific use-case. Famously LMArena.ai tested Llama 3.1 405B with bf16 and fp8, and the latter was only 2 Elo points below, well within measurement error.

[0]: https://github.com/ggml-org/llama.cpp/blob/master/examples/q...

[1]: https://github.com/ggml-org/llama.cpp/discussions/5063#discu...

[2]: https://pytorch.org/blog/quantization-aware-training/

[3]: https://mistral.ai/news/ministraux

sosuke · 2025-03-05T16:27:25 1741192045

I don't know what I'm talking about but when I first asked your question this https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... helped start me on a path to understanding. I think.

But if you don't already know the question your asking is not at all something I could distill down into a sentence or to that would make sense to a lay-person. Even then I know I couldn't distill it at all sorry.

Edit: I found this link I referenced above on quantized models by bartowski on huggingface https://huggingface.co/bartowski/Qwen2.5-Coder-14B-GGUF#whic...

Ambix · 2025-03-06T10:23:50 1741256630

I did my own experiments and it looks like (surprisingly) Q4KM models often outperforms Q6 and Q8 quantised models.

For bigger models (in range of 8B - 70B) the Q4KM is very good, there are no any degradation compared to full FP16 models.

jazzyjackson · 2025-03-05T16:33:52 1741192432

I returned an M2 Max studio with 96GB RAM, unquantized llama 70B 3.1 was dog slow, not an interactive pace. I'm interested in offline LLM but couldn't see how it was going to produce $3,000 ROI.

FloatArtifact · 2025-03-05T18:20:03 1741198803

It would be really cool if there was awebsite "we there yet" for reasonable offline AI.

It could track different hardware configurations and reasonably standardized benchmark performance per model. I know there's benchmarks buried in github Llama repository.

robbomacrae · 2025-03-05T19:33:32 1741203212

There seems to be a LOT of interest in such a site in the comments here. There seem to be multiple IP issues with sharing your code repo with an online service so I feel a lot of folks are waiting for the hardware to make this possible.

We need a SWE-bench for open source LLM's and for each model to have 3Dmark like benchmarks on various hardware setups.

I did find this which seems very helpful but is missing the latest models and hardware options. https://kamilstanuch.github.io/LLM-token-generation-simulato...

FloatArtifact · 2025-03-06T17:03:24 1741280604

Looks like he bases the benchmarks off of https://github.com/ggml-org/llama.cpp/discussions/4167

I get why he calls it a simulator, as it can simulate token output. It's an important aspect for evaluating use case if you need to get a sense of how much token output is relevant beyond the simple tokens per second text.

slama · 2025-03-05T16:48:04 1741193284

The M3 Ultra is the only configuration that supports 512GB and it has memory bandwidth of 819GB/s.

wkat4242 · 2025-03-05T16:02:53 1741190573

True, I also noticed that bigger models run slower at the same memory bandwidth (makes sense).

memhole · 2025-03-05T16:02:55 1741190575

Yeah, I don’t think RAM is the bottleneck. Which is unfortunate. It feels like a missed opportunity for them. I think Apple partly became popular because it enabled creatives and developers.

throw-qqqqq · 2025-03-05T16:25:52 1741191952

> I don’t think RAM is the bottleneck

Not the size/amount, but the memory bandwidth usually is.