Can you point me to something I can read that spells out the > almost entirely s...

NitpickLawyer · 2025-05-17T06:30:21 1747463421

Yes, there's a podcast with the post-training lead for L3 where he mentions this. Lemme try and find it.

edit: found it. The money quote is here, but I really recommend the entire podcast since it's full of great tidbits and insights.

> Thomas [00:33:44]: You mean between supervised fine-tuning like supervised fine-tuning annotation and preference annotation? Yeah. So 100% to RLHF. In fact, that's quite interesting. You start for Llama 2 with a pre-trained model and you have to have an instruction model to chat model. Otherwise, like the model is just like continue finishing sentences. So you need that to start RLHF. So we had to annotate like 10,000 examples. What did we do for Llama 3? You start with a new pre-trained model and then you want, before starting the RLHF, to have now a chat model, which is not too bad. The option one was, let's do human annotation again, like SFT stage. But in fact, by the principle I said before, the annotation would be actually worse than Llama 2. So what we did is that we generated all the data on the prompts with Llama 2 and we applied like basically the last round of Llama 2 we had to kick off and start Llama 3 post-training. So Llama 3 post-training doesn't have any like human written answers there basically, almost. It's just leveraging pure synthetic data from Llama 2.

https://www.latent.space/p/llama-3