I don't have any intuition here and am in no way qualified, but my read of the paper was that GRPO was mainly an optimization to reduce cost & GPUs when training (by skipping the need to keep another copy of the LLM in memory as the value network), but otherwise any RL algorithm should have worked? I mean it seems R1 uses outcome rewards only and GRPO doesn't do anything special to alleviate reward sparsity, so it feels like it shouldn't affect viability too much.
Also on the note of RL optimizers, if anyone here is familiar with this space can they comment on how the recently introduced PRIME [1] compares to PPO directly? Their description is confusing since the "implicit PRM" they introduce which is trained alongside the policy network seems no different from the value network of PPO.
Also on the note of RL optimizers, if anyone here is familiar with this space can they comment on how the recently introduced PRIME [1] compares to PPO directly? Their description is confusing since the "implicit PRM" they introduce which is trained alongside the policy network seems no different from the value network of PPO.
[1] https://github.com/PRIME-RL/PRIME