Also, there are so many innovations in their papers (Deepseek math, Deepseek v2/v3, R1) that I honestly wouldn’t even care. They figured out a way to train on only 2048 H800s when big companies are buying them in the hundreds of thousands. They created a new RL algorithm. They improved MoE. They improved the KV cache. They built an super efficient training framework.