> BitNet b1.58 can match the performance of the full precision baseline starting from a 3B size. ... This demonstrates that BitNet b1.58 is a Pareto improvement over the state-of-the-art LLM models.
> BitNet b1.58 is enabling a new scaling law with respect to model performance and inference cost. As a reference, we can have the following equivalence between different model sizes in 1.58-bit and 16-bit based on the results in Figure 2 and 3.
> • 13B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 3B FP16 LLM.
> • 30B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 7B FP16 LLM.
> • 70B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 13B FP16 LLM.
This paper seems to represent a monumental breakthrough in LLM efficiency, as the efficiency gains come with zero (or negative) performance penalty.
Does it seem at all likely that existing models could be converted?
It’s a pity if realizing these gains absolutely requires full pre-training from scratch. I imagine more than a few people will at least try to find a way to repurpose the knowledge contained in existing models.
You can also have another model "mentor" a new model you are teaching to speed up training. You don't have to start from scratch with zero knowledge. This is done a lot in what are called distillations.
However they trained their models from scratch, which is also why they only have meaningful numbers for 700M, 1.3B, 3B and 3.9B models. Apparently they are following BitNet's approach of replacing linear layers with quantized layers during training? If it was trivial to convert existing models without performance loss I would have expected them to include a benchmark of that somewhere in the paper to generate even more impact.
You're both right, I skimmed the paper, saw large model numbers but didn't notice it was for speed. On the HF page they say those models are being trained.
"We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready."
Yes. I wonder then how long before someone that does have a lot of compute power like OpenAI/MS, or others, can rapidly pivot and try this out on some even larger models.
Doesn't this mean that current big players can rapidly expand by huge multiples in size.?
I wonder if 1bit quantization is the main reason why pplx.ai is faster than any other RAG or chatbot. For instance, Gemini in comparison is a turtle, though it is better at explanations, while pplx is concise.
Nop. The model on Perplexity is a finetuned GPT 3.5 (the free one).
And the paid versons, well, you can choose between GPT4 (not turbo), Gemini pro, Claude, etc.
You can choose their model ("Experimental"), but is not faster than the other models.
All of these, proprietary models are fast on Perplexity. I do guess they are using some insane cache system, better API infrastructure...
Absolutely not, 1 bit isn't even real yet. perplexity does a ton of precaching, TL;Dr every novel query is an opportunity to cache: each web page response, the response turned into embeddings, and the LLM response. That's also why I hate it, it's just a rushed version of RAG with roughly the same privacy guarantees any incumbent would have given you in last 15 years (read: none, and gleefully will exploit yours while saying "whoops!")
> BitNet b1.58 is enabling a new scaling law with respect to model performance and inference cost. As a reference, we can have the following equivalence between different model sizes in 1.58-bit and 16-bit based on the results in Figure 2 and 3.
> • 13B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 3B FP16 LLM.
> • 30B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 7B FP16 LLM.
> • 70B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 13B FP16 LLM.
This paper seems to represent a monumental breakthrough in LLM efficiency, as the efficiency gains come with zero (or negative) performance penalty.
Does it seem at all likely that existing models could be converted?