Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you explain for a noob why?


Easier to train, easier to experiment with. Most research and prototyping happens on the scale that is just barely out of the "toy" category.


Quantization means reducing the number of bits used to encode each floating point number constituting a parameter in the model So instead of having billions of possible values per weight, you might have just 255. The model has to have its weights crammed into a much smaller number of possible values, which reduces its ability to produce good outputs.


Sorry, my question is, why are the 7B models so exciting?


They don't require really expensive and power-hungry components to run, i.e. a mid-range GPU can run a (4-or-5-bit quantized) 7B model at +50 tokens/second, so it's completely feasible to run on a small budget. They are easier to fine-tune, because they are smaller, and you can even just do CPU inference if you really want. There are good OSS implementations like llama.cpp and exllama. And there is a lot of belief that 7B models are not yet tapped out in terms of efficacy, so they will keep improving.


A 7b quantised model is also about the biggest you can run on an M1 MacBook too. It's nowhere near that speed but it does work.


To add some numbers to sibling's comment, if a parameter is originally fp16 (a half precision float, I think this is what LLaMA was trained on) you need 16bit*7*10^9 ~= 13GiB of RAM to fit a whole 7B model in memory. Current high-end consumer GPUs (4090) top at 24GB, so these small models fit in GPUs you can have at home.

For comparison, the next largest size is usually 13B which at fp16 already takes ~24GiB (some of which you'll be using for your regular applications like your browser, the OS, etc.)

7B also faster since the critical path of the signal flow is smaller.

Training requires even more RAM (and the more RAM you have the faster you can train).

You could quantize 13B to make it fit in consumer cards without large losses (see e.g. charts for k-quants LLaMA inference[0]) but training on quantized models impacts more than inference (couldn't find charts here, I'm on mobile). But this means you could also quantize 7B models to run them on even less powerful GPUs like low-end consumer GPUs or even eventually mobile phones (which are also power-sensitive due to running on batteries).

[0] https://github.com/ggerganov/llama.cpp/pull/1684




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: