With DeepSeek-R1-Zero, their usage of RL didn't have reward functions really that indicated progress towards the goal afaik.
It was "correct structure, wrong answer", "correct answer", "wrong answer". This was for Math & Coding, where they could verify answers deterministically.
It was "correct structure, wrong answer", "correct answer", "wrong answer". This was for Math & Coding, where they could verify answers deterministically.