Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With existing common quantization techniques, a 70b model quantized to 3-bit still drastically outperforms an unquantized 35b model.


Are you sure? I was under impression that 3b quantization still results in a significant degradation. Which quantization method are you talking about?


It does result in a significant degradation relative to unquantized model of the same size, but even with simple llama.cpp K-quantization, it's still worth it all the way down to 2-bit. The chart in this llama.cpp PR speaks for itself:

https://github.com/ggerganov/llama.cpp/pull/1684#issue-17396...


Oh wow, you’re right. Though it seems that they are using very small weight group sizes: either 16 or 32 (fp16 scaling factor per group). In this paper it seems there’s no weights grouping, so it’s a bit apples to oranges.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: