How we got fine-tuning Mistral-7B to not suck

isaacfrond · on Feb 8, 2024

If you look at the source [1] you can see how they solved their what are the doctors going to do problem. It is literally included in one of the prompts now :-)

Users tend to ask broad, vague questions of the document in order to test that the system is working. We want those queries to work well. For example, a user would ask "what are the doctors going to do?" of a document that is about a junior doctors' strike. Take this into account when generating the questions - in particular, refer to noun phrases by less specific descriptions, so for example instead of "junior doctors", say "doctors" in your questions.

[1]: https://github.com/helixml/helix/blob/main/api/pkg/dataprep/...

lewq · on Feb 8, 2024

That's right: the question is just does it generalize? :-D

bugglebeetle · on Feb 8, 2024

Unsloth’s colab notebooks for fine-tuning Mistral-7B are super easy to use and run fine in just about any colab instance:

https://github.com/unslothai/unsloth

It’s my default now for experimenting and basic training. If I want to get into the weeds, I use axolotl, but 9/10, it’s not really necessary.

binocarlos · on Feb 8, 2024

Excellent link thank you - I will make sure we check this out because as the README says:

> Finetune Mistral, Llama 2-5x faster with 70% less memory!

Could be very useful for us!

Disclaimer: I work on Helix

danielhanchen · on Feb 10, 2024

Great article on Helix - love the QA generation part :) If you have any questions on Unsloth, more than happy to answer them :) If you need help setting stuff up, I'm also here to help! :) ( I'm the engineer behind Unsloth :) )

danielhanchen · on Feb 10, 2024

Super thanks for sharing Unsloth! :) Highly appreciative and high praise as well :))

3abiton · on Feb 8, 2024

How much improvement do you get by finetuning?

bugglebeetle · on Feb 8, 2024

Quite a bit, as I’m generally using it for highly-specific text extraction and conversion tasks.

nl · on Feb 8, 2024

I've done fine tuning too, but the reasons they mention in "Why not just use RAG?" aren't very good.

People way understimate what RAG can do, even if in general people don't talk about the right things. For example LlamaIndex spends a lot of time talking about various extractors which is the easy part. The hard thing is deciding what you are actually searching for given a chat context.

RAG is a horrible hack (and the more you understand about the more it seems so!) but it does work.

I (and I'm sure everyone else) is experimenting with surgery on an LLM so it takes a vector representation of the docs directly alongside a text input so you don't have to do the lossy doc vector -> text -> LLM context -> vector thing. Not sure why no one has shipped this yet though!

formercoder · on Feb 8, 2024

Why is RAG a horrible hack? LLMs can draw from only 2 sources of data: their parametric knowledge and the prompt. The prompt seems like a pretty good place to put new information they need to reason with.

nl · on Feb 8, 2024

RAG is a hack for lots of reasons, but the reason I'm focused on at the moment is the pipeline.

Say you are trying to do RAG in a chat-type application. You do the following:

1) Summarize the context of chat into some text that is suitable for a search (lossy).

2) Turn this into a vector embedded in a particular vector space.

3) Use this vector to query a vector database, which returns reference to documents or document fragments (which themselves have been indexed as a lossy vector).

4) Take the text of these fragments and put them in the context of the LLM as input.

5) Modify the prompt to explain what these fragments are.

6) Then the prompt is sent to the LLM, which turns it into it's own vector representation.

An obvious improvement to this is that the VectorDB and the LLM should share an internal representation, and the VectorDB should understand this. The LLM should take this vector input as a second input alongside the text context and the LLM should combine them (in the same way you can put a text and image into a multi-modal model)

mrfox321 · on Feb 8, 2024

I guess op may be envisioning an end-to-end solution that can train a model in the context of an external document store.

I.e. One day we want to be able to backprop through the database.

Search systems face equivalent problems. The hierarchy of ML retrieval systems are separately optimized (trained). Maybe this helps regularize things, but, given enough compute / complexity, it is theoretically possible to differentiate through more of the stack.

benpacker · on Feb 8, 2024

Is that lossy?

Isn't the vector representation of the text (and the ANN index itself) lossy, and the source text itself the source of truth?

kkzz99 · on Feb 8, 2024

You use the vector embedding only for retrieval of the actual text, which then gets injected into the prompt.

nl · on Feb 8, 2024

The conversion of a conversation context to vector for retrieval is lossy and often very manual.

gdiamos · on Feb 8, 2024

Glad to see that more people outside the big ai labs are figuring out how to do fine tuning. Some open source LLM authors also seem to have figured it out.

I think many users get put off it because just pushing a button doesn’t work and the whole thing seems like a black box that you don’t know how to fix when it breaks.

It turns out that finetuning can be debugged, but the methods aren’t well documented (yet), eg by generating q/a, oversampling them, etc

When you get it to work it’s powerful - new abilities emerge beyond memorization.

Just like how llama2/claude2/gpt4 learned reasoning by memorizing sentences from Reddit posts :P

Also, I don’t get the comparison of rag vs finetuning in articles like this - why not do both. RAG is easy to setup - it’s push button. Just do it on all models (including finetuned models).

binocarlos · on Feb 8, 2024

Thanks for the feedback - I agree that fine tuning a) has potential and b) is not easy :-)

> Also, I don’t get the comparison of rag vs finetuning in articles like this - why not do both

It's interesting you say this because we are very close to adding RAG support to Helix sessions and it will be "both at the same time" not an "either or" setup. You can choose to do either or but we are interested in seeing if doing both at the same time yields better results than either or - watch this space!

disclaimer: I work on Helix

scoot · on Feb 8, 2024

> RAG is easy to setup - it’s push button

I'm interested to hear about push-button solutions for RAG that aren't a SaaS.

gdiamos · on Feb 8, 2024

https://github.com/lamini-ai/lamini-sdk/tree/main/03_RAG

You can implement RAG in 80 lines of python and 0 SaaS libraries.

It's extremely easy.

1. Load your data as a giant string (streaming)

2. Chunk it (big chunk size, small chunk steps)

3. Call an LLM to convert chunk -> embedding, store in an index (or just concat it onto a numpy array)

4. Call an LLM to convert query -> embedding

5. Compute cosine similarity between the embeddings, pick the max

6. Insert the picked chunks into the LLM prompt

That's it. I'd encourage you to try to implement it yourself.

Anything beyond this is unnecessary complexity.

I walk through the code/whiteboard of the whole thing in this video: https://www.youtube.com/watch?v=Xkzd_YNbWmc&t=6003s

joshka · on Feb 8, 2024

For helix, I notice that GitHub is listed as a data source, but there's nothing in the docs about this. I'd really love to see what a model trained on my commonly used git repos (which generally are newer than The Stack etc), and in particular their commit history. Ideally these would make it easier for code completion to have the historical context as well as the current code to play with in determining what to write next.

I often wonder how you'd go about organizing training data for a full historic github repo in a way that makes sense for training (or RAG)? The vast majority of the data is previous changes to the repo. I think this would generally mean that it would outweigh the current information and cause problems (i.e. old method names before refactoring etc.)

Also, perhaps being able to expand that out to doing the same thing for a bunch of consumers of the library that I'm maintaining would be neat.

Sprinkle in the PR and Issue history, docs website, API docs, and discord history and I think you'd have a helluva model.

binocarlos · on Feb 8, 2024

This is spot on, the thing we've not yet done is make it easy to import a repo(s) code and the associated metadata into a fine tuning session easily.

> I often wonder how you'd go about organizing training data for a full historic github repo in a way that makes sense for training (or RAG)?

This is the hard part :-) But you are right - it would be intriguing to see what the output of a fune-tuned & RAG model would look like for this use-case. We are currently experimenting with adding RAG alongside the fine tuned model (so it's both, not either or) to see if it produces better results.

I will make sure we take a look at the gihub repo use case because it feels like that would be an interesting experiment to do!

disclaimer: I work on Helix

joshka · on Feb 8, 2024

Reading through the dataprep stuff, I wonder if doing more RAG during the prep stage might help this sort of task on structured daa. E.g. pre-indexing related parts and using those to build summaries / QA pairs. I took a look at the current prompts that are very research focused ("professors" creating questions), and could extrapolate from that to a dev mindset nicely.

cuuupid · on Feb 8, 2024

Not in love with axolotl but appreciate the advantages. This is an interesting approach, but you can also finetune easily on providers who wrap axolotl like Replicate [1], Modal [2], or if you want to run the infra, LLM Engine [3].

My only gripe with Helix would be that it's smaller than the above and my org would be peeved about data security. The ability to self host is cool, but too much can go wrong too quickly with plain Docker ML. Would love to see, for example, a `cog` version of the images that we can deploy distributed with more confidence/bravado.

[1] https://replicate.com/mistralai/mistral-7b-instruct-v0.2 [2] https://modal.com [3] https://llm-engine.scale.com/

josh-sematic · on Feb 8, 2024

I’ll add https://airtrain.ai to that list of push-button fine tunes using axolotl. Disclaimer: I’m the engineer who built it.

AznHisoka · on Feb 8, 2024

Does fine tuning it on a set of docs in your “knowledge base” help for generalizing it so it can answer questions pertaining to new documents that come in (with a “similar” style/structure but with different content/fscts)?

binocarlos · on Feb 8, 2024

Fine tuning on your documents will really help to answer questions in the style and tone of those documents, so in that way, yes it helps.

It would be possible to include some parts of the new documents in the prompt so you can answer questions about new facts in the style and tone of your old documents, which we feel is useful. We are also experimenting with adding Retrieval Augmented Generation alongside fine tuning to see if the results are better than either or.

disclaimer: I work on Helix

ivanvas · on Feb 8, 2024

Is it En only or other languages would work?

_pdp_ · on Feb 8, 2024

Interesting article but, IMHO, completely impractical. Teaching the model about specific content is totally what you should not do. What you should do is to teach the model how to effectively retrieve the information even if it is unsuccessful on the first try.

binocarlos · on Feb 8, 2024

We are finding that fine tuning is very good at setting the style and tone of responses. A potential use case we are thinking about is what if your star sales person leaves the company? Could you fine tune an LLM on their conversations with customers and then do inference where it would write text in the style of your star sales person.

We are also adding function calling so the model would know to reach out to an external API to fetch some data before generating a response.

disclaimer: I work on Helix

gdiamos · on Feb 8, 2024

I really don’t get this sentiment - why not do both?

Retrieval allows looking up facts - eg in a Google search

Finetuning allows reasoning using new knowledge.

Humans do both.

swalsh · on Feb 8, 2024

I think fine tuning makes sense when you need some domain specific knowledge to properly read, analyze, and interpret the information you're passing to it. But its not an information store itself.

The most valuable skill an LLM can have is good reasoning skills and a broad enough knowledge base to understand. From there you can pass it the important bits it needs.

gdiamos · on Feb 8, 2024

I think we are saying the same thing.

The key ingredients are:

reasoning(skills) + knowledge + important bits/facts

The best systems have all of these

nicolezhu · on Feb 8, 2024

What are some os / hardware specific challenges you guys faced?

binocarlos · on Feb 8, 2024

Great question! scheduling workloads onto GPUs in a way where VRAM is being utilised efficiently was quite the challenge.

What we found was the IO latency for loading model weights into VRAM will kill responsiveness if you don't "re-use" sessions (i.e. where the model weights remain loaded and you run multiple inference sessions over the same loaded weights).

Obviously projects like https://github.com/vllm-project/vllm exist but we needed to build out a scheduler that can run a fleet of GPUs for a matrix of text/image vs inference/finetune sessions.

disclaimer: I work on Helix

ipsum2 · on Feb 8, 2024

The tl;dr seems to be: Tell a LLM to create pairs of questions and answers based off of a document, and fine-tune on that data. Does the model answer questions from the article that weren't generated in advance?

HanClinto · on Feb 8, 2024

Fantastic writeup -- thank you so much for sharing your lessons learned along the way! Very valuable resource, and I'll be checking out your product!

deforciant · on Feb 7, 2024

I always thought that fine tuning is more like getting a style rather than memorizing information word to word or at least the facts. What are the next steps to ensure that it doesn't start pulling info from the base knowledge and reference the docs instead? How long does it usually take to train? 10-15 minutes on what doc size?

lewq · on Feb 8, 2024

Fine tuning is just more training -- so it's definitely possible to teach the model facts this way too.

In practice we've found that it's a bit of a balancing act to teach the model the new knowledge without destroying existing knowledge, but it's just a matter of tuning the parameters carefully. We're also researching whether we can fine-tune a brand new expert in a MoE model like Mixtral, I've also seen work on fine-tuning just a fixed set of weights. I'm sure there will be more developments in this space soon.

In terms of how you refer to new knowledge and not base knowledge, like many things in LLMs, you just ask the LLM :-) For example, if you look at this session https://app.tryhelix.ai/session/62905598-b1b7-4d93-bc39-5a93... and click "Show Info" at the top, you can see the system prompt is:

"You are an intelligent chatbot named Helix that has been fine-tuned on document(s) e1ef2e896c in document group 62905598b1. The document group contains 1 document(s). The user will ask you questions about these documents: you must ONLY answer with context from the documents listed. Do NOT refer to background knowledge."

It does a pretty good job at this, although I'm sure there are ways to improve it further.

Referencing the specific document IDs in the fine-tuning was an innovation that has really helped us.

In terms of training time, yeah - 5 minutes on a news article, 10 minutes on a typical length paper. Pretty usable. We're experimenting with reducing the number of epochs and increasing the learning rate to make it faster at that too.

gbickford · on Feb 8, 2024

Have you tried generating two sets of qapairs, one with bad answers, and using DPO?

lewq · on Feb 8, 2024

Not yet, sounds promising!

aCoreyJ · on Feb 8, 2024

What is the advantage over using Retrieval Augmented Generation ?

mendeza · on Feb 8, 2024

RAG adds context to the users question to reduce hallucination. https://docs.llamaindex.ai/en/stable/getting_started/concept...

aCoreyJ · on Feb 8, 2024

Actually missed this was covered in the post, thanks

aCoreyJ · on Feb 8, 2024

Actually missed this is answered in the article!

drphilwinder · on Feb 8, 2024

Your sentiment is correct, but it's more of a spectrum. Fine tuning can learn facts (otherwise how would the foundation models learn facts?). But it needs those facts in the training dataset. If you have an infinite amount of facts, then you can memorise all of them.

The challenge arises when it becomes hard to generate that training data. If you just have the raw text and pop that in the context (i.e. RAG), then the LLM can be just as factual without any of that hassle.

Q2: identifiers in the prompt to say "you've been trained on this, only answer questions about this".

Q3: Depends on the size of the training data/docs. For the average PDF, about 30 minutes.

Give it a try!

gpderetta · on Feb 8, 2024

> If you have an infinite amount of facts, then you can memorise all of them

pigeon-hole?

gdiamos · on Feb 8, 2024

Not literally infinite, but Llama2 scale models can handle about 10 trillion tokens.