I feel it’s not related to data or the architecture but the process of reasoning...

I feel it’s not related to data or the architecture but the process of reasoning in general. For these models, every token predicted condition or drives the output in certain direction. Semantic meaning of these tokens have a magnitude in solution space. Lets say ‘answer is 5’ is very large step and ‘and’ token is very small. If you are looking for a very specific answer, these smaller nudges of each token generation provide corrections to direction. Imagine trying to click on narrow button with high sensitivity mouse settings, obviously you need to do many smaller moves whereas with a big button maybe you can one shot it. The harder or specific a task is where a solution space is very narrow that it cant be possibly one shotted, you need to learn to take smaller steps and possibly revert if you feel overall direction is bad. This is what RL is teaching the model here, response length increases(model learns to take smaller steps, reverts etc) along with performance. You reward the model if solution is correct, model discovers being cautious and evaluating many steps is the better approach. Personally I feel this is how we reason, or reasoning is in general taking smaller steps and being able to evaluate if you are in a wrong position so you cna backtrack. Einstein didn’t one shot relativity after all and had to backtrack from who knows how many things.