It could be, but there’s so much hype surrounding the GPT-5 release that I’m not sure whether their internal models will live up to it.
For GPT-5 to dwarf these just-released models in importance, it would have to be a huge step forward, and I’m still doubting about OpenAI’s capabilities and infrastructure to handle demand at the moment.
As a sidebar, I’m still not sure if GPT-5 will be transformative due to its capabilities as much as its accessibility. All it really needs to do to be highly impactful is lower the barrier of entry for the more powerful models. I could see that contributing to it being worth the hype. Surely it will be better, but if more people are capable of leveraging it, that’s just as revolutionary, if not more.
That doesn’t sound good. It sounds like OpenAI will route my request to the cheapest model to them and the most expensive for me, with the minimum viable results.
The catch is that it only has ~5 billion active params so should perform worse than the top Deepseek and Qwen models, which have around 20-30 billion, unless OpenAI pulled off a miracle.
The catch is that performance is not actually comparable to o4-mini, never mind o3.
When it comes to LLMs, benchmarks are bullshit. If they sound too good to be true, it's because they are. The only thing benchmarks are useful for is preliminary screening - if the model does especially badly in them it's probably not good in general. But if it does good in them, that doesn't really tell you anything.
It's definitely interesting how the comments from right after the models were released were ecstatic about "SOTA performance" and how it is "equivalent to o3" and then comments like yours hours later after having actually tested it keep pointing out how it's garbage compared to even the current batch of open models let alone proprietary foundation models.
Yet another data point for benchmarks being utterly useless and completely gamed at this stage in the game by all the major AI developers.
These companies are clearly are all very aware that the initial wave of hype at release is "sticky" and drives buzz/tech news coverage while real world tests take much longer before that impression slowly starts to be undermined by practical usage and comparison to other models. Benchmarks with wildly over confident naming like "Humanity's Last Exam" aren't exactly helping with objectivity either.
Probably GPT5 will be way way better. If alpha/beta horizon are early previews of GPT5 family models, then coding should be > opus4 for modern frontend stuff.
What’s the catch?