With DeepSeek-R1-Zero, their usage of RL didn't have reward functions really tha... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		drakenot 12 months ago \| parent \| context \| favorite \| on: Recent results show that LLMs struggle with compos... With DeepSeek-R1-Zero, their usage of RL didn't have reward functions really that indicated progress towards the goal afaik. It was "correct structure, wrong answer", "correct answer", "wrong answer". This was for Math & Coding, where they could verify answers deterministically.

mountainriver 12 months ago | [–]

It is a reward function it’s just a deterministic one. Reward models are often hacked preventing real reasoning from being discovered

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact