it feels like openai are at a ceiling with their models, codex1 seems to be another RLHF derivative from the same base model. You can see this in their own self reported o3-high comparison where at 8 tries they converge at the same accuracy.
It also seems very telling they have not mentioned o4-high benchmarks at all. o4-mini exists, so logically there is an o4 full model right?
Seems likely that they are waiting to release o4 full results until the gpt-5 release later this year, presumably because gpt-5 is bundled with a roughly o4 level reasoning capability, and they want gpt-5 to feel like a significant release.
Marketing names aren’t really connected to product generations. We might target v3 of a product for a date and then decide it’s really 2.4, doesn’t mean we won’t market something as v3 later.
It also seems very telling they have not mentioned o4-high benchmarks at all. o4-mini exists, so logically there is an o4 full model right?