I was including RLHF in "training". And even the system prompt, really.
If it's true that models can be prevented from spiraling into dead ends with "proper prompting" as the comment above claimed, then it's also true that this can be addressed earlier in the process.
As it stands, this behavior isn't likely to be useful for any normal user, and it's certainly a blocker to "agentic" use.
The RLHF is happening too late i think. I think the reinforcement learning needs to be during the initial next token prodiction. On that note we need something to represent a complex world state than just language.
If it's true that models can be prevented from spiraling into dead ends with "proper prompting" as the comment above claimed, then it's also true that this can be addressed earlier in the process.
As it stands, this behavior isn't likely to be useful for any normal user, and it's certainly a blocker to "agentic" use.