> *if a llm will run with usable performance at that scale?* Yes. The reason: Mo...

_aavaa_ · 2025-03-05T16:33:00 1741192380

This doesn’t sound correct.

You don’t know which expert you’ll need for each layer, so you either keep them all loaded in memory or stream them from disk

espadrine · 2025-03-05T16:39:43 1741192783

In RAM, yes. But if you compute an activation, you need to load the weights from RAM to the GPU core.

_aavaa_ · 2025-03-05T16:44:28 1741193068

Got you, yeah I misread you commend the first time around

kgwgk · 2025-03-05T16:36:11 1741192571

Note that 404 < 512

fullstackchris · 2025-03-05T16:20:14 1741191614

You seem like you know what you are talking about... mind if I ask what your thoughts on quantization are? Its unclear to me if quantization affects quality... I feel like I've heard yes and no arguments

espadrine · 2025-03-05T16:38:31 1741192711

There is no question that quantization degrades quality. The GGUF R1 uses Q4_K_M, which, on Llama-3-8B, increases the perplexity by 0.18[0]. Many plots show increasing degradation as you quantize more[1].

That said, it is possible to train a model in a quantization-aware way[2][3], which improves the quality a bit, although not higher than the raw model.

Also, a loss in quality may not be perceptible in a specific use-case. Famously LMArena.ai tested Llama 3.1 405B with bf16 and fp8, and the latter was only 2 Elo points below, well within measurement error.

[0]: https://github.com/ggml-org/llama.cpp/blob/master/examples/q...

[1]: https://github.com/ggml-org/llama.cpp/discussions/5063#discu...

[2]: https://pytorch.org/blog/quantization-aware-training/

[3]: https://mistral.ai/news/ministraux

sosuke · 2025-03-05T16:27:25 1741192045

I don't know what I'm talking about but when I first asked your question this https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... helped start me on a path to understanding. I think.

But if you don't already know the question your asking is not at all something I could distill down into a sentence or to that would make sense to a lay-person. Even then I know I couldn't distill it at all sorry.

Edit: I found this link I referenced above on quantized models by bartowski on huggingface https://huggingface.co/bartowski/Qwen2.5-Coder-14B-GGUF#whic...

Ambix · 2025-03-06T10:23:50 1741256630

I did my own experiments and it looks like (surprisingly) Q4KM models often outperforms Q6 and Q8 quantised models.

For bigger models (in range of 8B - 70B) the Q4KM is very good, there are no any degradation compared to full FP16 models.