Note that I was only commenting on modern quantized LLM's that basically avoid f...

kamranjon · 2025-05-03T18:56:34 1746298594

That's really interesting! I didn't know that about the padding behavior here. I am interested to know which models this would include? I know Gemma 3 raw is bf16 - are you just talking about the quantized versions of these? Or are models being released purely as quantized versions these days? I know Google just released a QAT (Quantization Aware Training) model of Gemma 3 27b - but that base model was already released.

zozbot234 · 2025-05-03T19:01:14 1746298874

Models may be released as unquantized (and even then they are gradually shifting towards lower precisions over time), but most people are going to be running them in a quantized version simply because that gives you the best bang for your buck (you can fit more interesting models on the same hardware). Of course this is strictly about local LLM inference, though one may reasonably assume that the big players are also doing something similar.