Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DeepSpeed Chat: Easy, fast and affordable RLHF training of ChatGPT-like models (github.com/microsoft)
240 points by quantisan on April 12, 2023 | hide | past | favorite | 55 comments


Microsoft: invests 10 billion in company. Also Microsoft: here's the tools you need to DIY one of the premium features the company we just invested 10 billion in for free.

Not that reproducing GPT-4 is going to be easy with this, but it'll definitely get rid of some major hurdles. I read a report about the difficulties HuggingFace had with producing their Bloom model, and a lot of it was the sort of straight forward systems engineering that goes into tooling like this.

Is the Bloom model considered a failure by the community? If you read the introduction it was supposed to include improvements over GPT3, but it performs much worse, I guess because of lower quality training data? I wonder what sort of company would have high enough quality data that they could use this project to fine tune a public model to the point where it would be better in some scenario than plain old GPT4 would be. Especially when you can just inject extra info in to the GPT4 prompt, like phind does for example. What even is the use of fine tuning given GPT 4 exists?


> Microsoft: invests 10 billion in company. Also Microsoft: here's the tools you need to DIY one of the premium features the company we just invested 10 billion in for free.

In my mind, MSFT spent that money to acquire a head start on getting LLM-style capabilities into MS’s profitable product portfolio. This is money well spent:

1. MSFT can and will make money on these capabilities.

2. If MSFT didn’t do this, they would take the substantial risk of someone else pulling it off and attacking their moat.

I can’t really imagine today’s Google pulling this off with Google Docs. Adobe doesn’t target MS’s market directly enough to be an immediate risk. Apple doesn’t seem interested in competing with MS. Meta is doing its own thing in the corner. But someone really could attack MS with something amazing and make short-term sales, which could turn into a long term loss for MS. (Salesforce? They don’t seem able make things that normal people want to use.). But MS is now ahead of the curve, and they didn’t really spend that much money to get there.

Keep in mind that LibreOffice is not vastly less capable than Office 365, and Teams is not exactly an insurmountably strong piece of technology.


> In my mind, MSFT spent that money to acquire a head start on getting LLM-style capabilities into MS’s profitable product portfolio. This is money well spent:

personally I think in 10 years people will joke about the machine generated boilerplate in the same way they joke about Clippy today


Also, they will definitely be making some upside on the OpenAI stock…


My point is that MS’s investment in OpenAI may be a good deal for MS regardless of what happens to the valuation of OpenAI-the-company.

The LLM space is moving fast. OpenAI may stay on top for a long time, or it may not. But I expect Microsoft’s use of LLMs to be valuable for MS and likely market-leading in the office AI space for quite some time regardless of what happens to OpenAI.


Look at the HuggingFace codebase and you'll understand why Bloom is so subpar. Shame that funding money wasn't given to the Eleuther AI team instead.


Please elaborate? I’m not familiar with this.


can you elaborate or source?


Microsoft: invests 10 billion in company. Also Microsoft: here's the tools you need to DIY one of the premium features the company we just invested 10 billion in for free.

The idea is to get the people who arent willing to pay hooked on what they offer. Once you are used to a system you will probably want the same thing at your workplace, where they can charge a a prumium. Same thing was done with windows in asia.


> Microsoft: invests 10 billion in company. Also Microsoft: here's the tools you need to DIY one of the premium features the company we just invested 10 billion in for free.

This seems like more evidence that under the "commoditize your complement" framework, all intellectual property is the complement, and the only thing actually worth selling for Microsoft is subscriptions and server time.


Fine tuning an existing model won’t get you to GPT4 level quality. You need a larger base model, and lots of fine tuning to have a chance.


Yeah. Most of what is valuable to me about GPT-4 is its reasoning ability, not fact recall or writing quality. Fact recall has been mostly solved by Google search cards for years, and writing quality is not the most important thing now that I'm no longer a freelance writer, GPT-3.5 and some of the good OS models like Koala produce okay writing quality.

What nothing else can provide is something that will reason intelligently with the data you give it, with similar or better quality to paying for something like MTurk, for much cheaper and nearly instant delivery. That reasoning ability comes from the model size and training data quality, and in real applications using CoT, LangChain etc a lot of it comes from the context length. 8k is better than anything else I've tried at real use cases, and I very much want to try 32k because that opens up a lot of space to do new things (e.g. dump in a textbook on the domain you want the model to reason about). I want even longer context lengths than that too, but we'll have to see how it develops. From what I understand context length/block size is a pretty pure relationship to the amount of compute and memory they're willing to devote during training. RWKV's architectural changes may shake that up a bit, we'll see when Stability releases it.


Microsoft also has partnership with Databricks who is doing Dolly.

Databricks wants to have people use their compute to use other LLMs and Microsoft wants that compute to be Azure.


>>>Microsoft: invests 10 billion in company. Also Microsoft: here's the tools you need to DIY one of the premium features the company we just invested 10 billion in for free.

your point?


Also see the example repo README: https://github.com/microsoft/DeepSpeedExamples/tree/master/a...

> With just one click, you can train, generate and serve a 1.3 billion parameter ChatGPT model within 1.36 hours on a single consumer-grade NVIDIA A6000 GPU with 48GB memory. On a single DGX node with 8 NVIDIA A100-40G GPUs, DeepSpeed-Chat enables training for a 13 billion parameter ChatGPT model in 13.6 hours. On multi-GPU multi-node systems (cloud scenarios),i.e., 8 DGX nodes with 8 NVIDIA A100 GPUs/node, DeepSpeed-Chat can train a 66 billion parameter ChatGPT model under 9 hours. Finally, it enables 15X faster training over the existing RLHF systems

> The following are some of the open-source examples that are powered by DeepSpeed: Databricks Dolly, LMFlow, CarperAI-TRLX, Huggingface-PEFT

(disclaimer: MSFT/GH employee, not affiliated with this project)


> single consumer-grade NVIDIA A6000 GPU with 48GB memory

I wouldn't call an A6000 "consumer-grade" -- it's about $5000 and 99% of consumers that have graphics cards wouldn't have that.

Top of the line consumer grade GPU would be a Nvidia RTX 4090/3090 with 24GB VRAM.


It is solidly a workstation card that is often deployed in data centres. Consumers are far better off with a 4090.


“Consumer grade” here means “you can buy it in a store”. (This is not true of DGX devices.)


Your local MicroCenter doesn’t stock DGX A100/H100 ???


So "off the shelf" rather then.


Agreed, but two RTX 3090/4090 should be as capable in this regard (having 2x 24GB).


AFAIK the 4090 cannot share ram like that (no nvlink)


Is it? It might have even more computing power, but are cards now able to share VRAM now? My hands-on experience of all this is from a few years ago and I think it was not possible back then.


DeepSpeed is designed to take care of spreading the work for you.

You can link two 3090s with nvlink to increase bandwidth.


I can see the future where each company will have "assistant AI model" trained/updated on its internal data at periodic intervals. The data sources could be group emails, slack / team messages, docs, company pdfs and so on. Maybe MS will provide it to you since it already has access to many of the data sources.


Honestly if you can train a decent model for $1,000, a 15 person company could afford to train it up monthly.


Note that by "train" here they mean "finetune that network a bit", not train from scratch.


For the 1.3 billion parameter one they do mean train from scratch on your 2 hour “coffee break”.

Which totally begs the question, Microsoft AI researchers get two hour coffee breaks?

—edit—

The way they word it is confusing but, yeah, fine tune the model on your two hour coffee break.


No, you can see the time breakdown in the table under the "coffee break" quote: it is the time for the 3-step RLHF process only. Training a 1.3B parameter model from scratch is still a very large undertaking.


FYI they don't compare to trlX bc trlX is roughly just as fast. Similarly, they put trl in the worst light possible (trl is actually must faster than they claim.)


We're doing some stuff with NVIDIA right now that I can't talk about yet. Super exciting though.


[flagged]


if you are gonna advertise, at least give some hard figures on performance improvements


To use RLHF you need a dataset that includes instructions with good & bad answers - do many of those exist? I know there are a few datasets of just plain instructions-with-responses, but I'm not aware of any that have both good and bad (or ranked) responses. Is that trivial, or an important missing element here?


All of the UX interface have little up/down thumb icons... that's where the boolean feedback comes from. If people stop using that, sentiment analysis on the human responses will likely go a long way.


OpenAssistant has been collecting instruction/response data. They’ve already used that data to refine several llama models with good success.

You can also bootstrap RLHF training data from the gpt4 api. Vicuna is probably the best public model created with gpt4 data available as of today.


If I understood correctly, the OpenAssistant team wants to open-source their community built RLHF dataset.

On the other hand, if you're being cheeky, I bet there's a way to datamine from websites like ShareGPT and profit off shared ChatGPT <> User interactions.


You’re not supposed to do this, but GPT-4 can generate RLHF data.


It's a little funny how Microsoft DeepSpeed doesn't fully work on Windows


Speed requires sacrifices


It is great. I remember the microsoft that told developers at talks not to use Google search because it is not M$


I've gotten so used to ChatGPT I just copied the text of this and told it to summarize the entire thing down to 5 paragraphs.

I know there was a summary but the point is that ChatGPT just really accelerates a LOT of bulk work we were used to having to do manually.

It's an amazing time to be alive!


I do the same. Paste some text, ask for summary, then expand summary up to my knowledge level, then examples and analogies to get me comfortably beyond my level of understanding. It's pretty great.


Does the RLHF help with training a LLM model to produce better (more accurate) results for a particular problem domain (eg. Customer support for a particular company) or is it only helpful in training the LLM to be a chat agent in general or a chat agent with guard rails?


RLHF helps with getting the model in the "mood" to output responses in a certain style that is found to be helpful by users. i.e how to write a poem, or email.

It doesn't increase it's knowledge of the world or increase it's capabilities.


What's the difference between the critic model and the reward model? In the diagram they show both.

EDIT: Is the idea that the critic model learns via the PPO process and gives a value estimate to prefixes of the responses?


microsoft being more open than openai haha


This is a really cool step but as someone without the suggested GPU, it isn't easy or one click for me yet.

I am hoping that someone makes a very simple Jupyter notebook where I can enter my RLHF file and select a few other settings and just run (on AWS or Azure; willing to pay per fine-tuned model say $100-$500 for cloud credits + notebook access).


TFA is fairly literally an advertisement for easy LLM training on azure GPUs.


What is TFA? Thanks.


The featured article.

Aka, the article under discussion.

(Though some people use other words for the F part.)


Woah, I’ve never heard “featured” before! And suspect that many people mean that, I’ve just always taken the acronym as being crass and condescending.

Haven’t felt this way since I realized about 50% of my social network meant “In My Honest Opinion” and the other 50% meant “In My Humble Opinion”


Thanks!


"The Freaking Article", possibly


Thanks!


In case anyone else is curious, this yt video helps me understand how Azure ML works and I think with this understanding, combined with the Github read me, it will be doable. https://www.youtube.com/watch?v=yBVXR8G8Bg8




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: