Thanks for this very lucid post! For many use cases such as coding, formatting, it's very clear for the users how to define the reward function. Fore more intricate ones, you're right in that it can be tricky. I like your ideas of trying to provide tools to help here, and offering recurring reward functions as templates that will only need slight adaptations. It will be the user defining it, but there's a path to simplification. - The operational friction with getting the GPUs, optimizing compute and preparing the training are hard for RL, hence we got these things out of the way. -
Thanks for the very thoughtful suggestions and contacting, great input!