Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Coming up with a reward model seems to be really easy though.

Every decidable problem can be used as reward model. The only downside to this is that the LLM community has developed a severe disdain for making LLMs perform anything that can be verified by a classical algorithm. Only the most random data from the internet will do!



that would help with decidable problems but would still be not generalisable for problems with non trivial rewards, or ones with none.


Reasoning seems to generalize, insofar as o1 and DeepSeek-R1 are better at answering questions than their base models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: