Yann LeCun also keeps distorting what open source is. Neither Llama nor DeepSeek are open source, and they never were. Releasing weights is not open source - that’s just releasing the final result. DeepSeek does use a more permissive license than Llama does. But they’re not open source because the community does not have the necessary pieces to reproduce their work from scratch.
Open source means we need to be able to reproduce what they’ve built - which means transparency on the training data, training source code, evaluation suites, etc. For example, what AI2 does with their OLMo model:
Deepseek R1 is the closest thing we have to fully open-source currently. Open enough that Huggingface is recreating R1 completely out in the open. https://github.com/huggingface/open-r1
What they’re recreating is the evidence that some of the techniques work. But they’re starting with R1 as the input into those steps, not starting from scratch. I don’t think their work includes creating a base model.
The fundamental problem is that AI depends on massive amounts of IP theft. I’m not going to argue if that’s right or wrong, but without it we won’t even have open weights models.
Open source means we need to be able to reproduce what they’ve built - which means transparency on the training data, training source code, evaluation suites, etc. For example, what AI2 does with their OLMo model:
https://allenai.org/blog/olmo2