Coming up with a reward model seems to be really easy though.
Every decidable problem can be used as reward model. The only downside to this is that the LLM community has developed a severe disdain for making LLMs perform anything that can be verified by a classical algorithm. Only the most random data from the internet will do!
Every decidable problem can be used as reward model. The only downside to this is that the LLM community has developed a severe disdain for making LLMs perform anything that can be verified by a classical algorithm. Only the most random data from the internet will do!