Only 20 training samples improved llm performance, that sounds unrealistic! My experience with RLHF for LLM perf differs. Can you be more specific about the case where you achieved this and share technical details about how do you do that?
We are not doing RLHF but fine-tuning directly on a reward function. Our task was around improving a coding agent, coding in JSONata(https://jsonata.org).
GPT4o is quite bad in this, as there are not too many JSONata snippets on the internet. We collected 20 coding problems; the reward function then just assigned a scalar value based on whether the code output of the model was syntactically correct or not (Most interestingly, we found that by optimizing the syntax, it also got better at getting the semantics correct)
I think the discrepancy between our result with direct RL and your experience with RLHF comes from the fact that RLHF is built around non-verifiable/subjective domains, where intrinsically, the reward signal obtained by the HF-proxy is weak(er), i.e. for the same training scenario/prompt you need more samples to get to the same gradient.