Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

However they trained their models from scratch, which is also why they only have meaningful numbers for 700M, 1.3B, 3B and 3.9B models. Apparently they are following BitNet's approach of replacing linear layers with quantized layers during training? If it was trivial to convert existing models without performance loss I would have expected them to include a benchmark of that somewhere in the paper to generate even more impact.


They present numbers for 7B to 70B models as well.


Those numbers are for cost only, not performance. It’s not clear they actually trained a 70B vs. just using randomly initialized parameters.


They do not have perplexity numbers for the larger models (see Table 2), only speed and memory benchmarks.


You're both right, I skimmed the paper, saw large model numbers but didn't notice it was for speed. On the HF page they say those models are being trained.

https://huggingface.co/papers/2402.17764

"We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: