> BitNet b1.58 can match the performance of the full precision baseline starting...

btbuildem · on Feb 28, 2024

Discussion on HF [1] implies that no, conversion is not helpful. It would take training the model from scratch.

1: https://huggingface.co/papers/2402.17764

anon373839 · on Feb 28, 2024

It’s a pity if realizing these gains absolutely requires full pre-training from scratch. I imagine more than a few people will at least try to find a way to repurpose the knowledge contained in existing models.

cooljoseph · on Feb 28, 2024

You can also have another model "mentor" a new model you are teaching to speed up training. You don't have to start from scratch with zero knowledge. This is done a lot in what are called distillations.

eru · on Feb 29, 2024

You can also re-use a lot of the infrastructure. Eg you can re-use your training data.

fnordpiglet · on Feb 29, 2024

This came out a little bit ago, my open question is if this approach can be used to port weights between architectures like this.

https://arxiv.org/abs/2402.13144

accurrent · on Feb 28, 2024

They seem to be using LLAMA. Might be worth trying out. Their conversion formula seems stupidly simple.

wongarsu · on Feb 28, 2024

However they trained their models from scratch, which is also why they only have meaningful numbers for 700M, 1.3B, 3B and 3.9B models. Apparently they are following BitNet's approach of replacing linear layers with quantized layers during training? If it was trivial to convert existing models without performance loss I would have expected them to include a benchmark of that somewhere in the paper to generate even more impact.

imjonse · on Feb 28, 2024

They present numbers for 7B to 70B models as well.

anon373839 · on Feb 28, 2024

Those numbers are for cost only, not performance. It’s not clear they actually trained a 70B vs. just using randomly initialized parameters.

sp332 · on Feb 28, 2024

They do not have perplexity numbers for the larger models (see Table 2), only speed and memory benchmarks.

imjonse · on Feb 28, 2024

You're both right, I skimmed the paper, saw large model numbers but didn't notice it was for speed. On the HF page they say those models are being trained.

https://huggingface.co/papers/2402.17764

"We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready."

FrustratedMonky · on Feb 28, 2024

Yes. I wonder then how long before someone that does have a lot of compute power like OpenAI/MS, or others, can rapidly pivot and try this out on some even larger models.

Doesn't this mean that current big players can rapidly expand by huge multiples in size.?

ignoramous · on Feb 28, 2024

I wonder if 1bit quantization is the main reason why pplx.ai is faster than any other RAG or chatbot. For instance, Gemini in comparison is a turtle, though it is better at explanations, while pplx is concise.

vitorgrs · on Feb 29, 2024

Nop. The model on Perplexity is a finetuned GPT 3.5 (the free one). And the paid versons, well, you can choose between GPT4 (not turbo), Gemini pro, Claude, etc.

You can choose their model ("Experimental"), but is not faster than the other models.

All of these, proprietary models are fast on Perplexity. I do guess they are using some insane cache system, better API infrastructure...

refulgentis · on Feb 29, 2024

Absolutely not, 1 bit isn't even real yet. perplexity does a ton of precaching, TL;Dr every novel query is an opportunity to cache: each web page response, the response turned into embeddings, and the LLM response. That's also why I hate it, it's just a rushed version of RAG with roughly the same privacy guarantees any incumbent would have given you in last 15 years (read: none, and gleefully will exploit yours while saying "whoops!")