However they trained their models from scratch, which is also why they only have...

imjonse · on Feb 28, 2024

They present numbers for 7B to 70B models as well.

anon373839 · on Feb 28, 2024

Those numbers are for cost only, not performance. It’s not clear they actually trained a 70B vs. just using randomly initialized parameters.

sp332 · on Feb 28, 2024

They do not have perplexity numbers for the larger models (see Table 2), only speed and memory benchmarks.

imjonse · on Feb 28, 2024

You're both right, I skimmed the paper, saw large model numbers but didn't notice it was for speed. On the HF page they say those models are being trained.

https://huggingface.co/papers/2402.17764

"We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready."