Hacker News new | past | comments | ask | show | jobs | submit login

Yes, great point. We are currently working on multistep RL. The big problem with the trivial approach (give a single reward to the entire (ReAct) trajectory) is that the model receives a weak learning signal per decision (called credit assignment problem in literature), i.e. the individual decisions are not properly taken into account, which will then make the training unstable. I guess this has been an unsolved problem for a long time; however was not really looked at since generalist “planning” agents were not a big thing in RL until o1/DeepSeek.

IMO, the most promising approach to this is something along the lines of MA-RLHF (https://arxiv.org/abs/2410.02743) but adapted to the real world, i.e., spitting up the reward model to grade individual actions inside the trajectory to reduce the “attention distance” between the reward and the decision.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: