Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With DeepSeek-R1-Zero, their usage of RL didn't have reward functions really that indicated progress towards the goal afaik.

It was "correct structure, wrong answer", "correct answer", "wrong answer". This was for Math & Coding, where they could verify answers deterministically.



It is a reward function it’s just a deterministic one. Reward models are often hacked preventing real reasoning from being discovered




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: