remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent
remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent