Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Open weight models from OpenAI with performance comparable to that of o3 and o4-mini in benchmarks… well, I certainly wasn’t expecting that.

What’s the catch?



Because GPT-5 comes out later this week?


It could be, but there’s so much hype surrounding the GPT-5 release that I’m not sure whether their internal models will live up to it.

For GPT-5 to dwarf these just-released models in importance, it would have to be a huge step forward, and I’m still doubting about OpenAI’s capabilities and infrastructure to handle demand at the moment.


As a sidebar, I’m still not sure if GPT-5 will be transformative due to its capabilities as much as its accessibility. All it really needs to do to be highly impactful is lower the barrier of entry for the more powerful models. I could see that contributing to it being worth the hype. Surely it will be better, but if more people are capable of leveraging it, that’s just as revolutionary, if not more.


It seems like a big part of GPT-5 will be that it will be able to intelligently route your request to the appropriate model variant.


That doesn’t sound good. It sounds like OpenAI will route my request to the cheapest model to them and the most expensive for me, with the minimum viable results.


Sounds just like what a human would do. Or any business for that matter.


That may be true but I thought the promise was moving in the direction of AGI/ASI/whatever and that models would become more capable over time.


Surely OpenAI would not be releasing this now unless GPT-5 was much better than it.


The catch is that it only has ~5 billion active params so should perform worse than the top Deepseek and Qwen models, which have around 20-30 billion, unless OpenAI pulled off a miracle.


The catch is that performance is not actually comparable to o4-mini, never mind o3.

When it comes to LLMs, benchmarks are bullshit. If they sound too good to be true, it's because they are. The only thing benchmarks are useful for is preliminary screening - if the model does especially badly in them it's probably not good in general. But if it does good in them, that doesn't really tell you anything.


It's definitely interesting how the comments from right after the models were released were ecstatic about "SOTA performance" and how it is "equivalent to o3" and then comments like yours hours later after having actually tested it keep pointing out how it's garbage compared to even the current batch of open models let alone proprietary foundation models.

Yet another data point for benchmarks being utterly useless and completely gamed at this stage in the game by all the major AI developers.

These companies are clearly are all very aware that the initial wave of hype at release is "sticky" and drives buzz/tech news coverage while real world tests take much longer before that impression slowly starts to be undermined by practical usage and comparison to other models. Benchmarks with wildly over confident naming like "Humanity's Last Exam" aren't exactly helping with objectivity either.


> What’s the catch?

Probably GPT5 will be way way better. If alpha/beta horizon are early previews of GPT5 family models, then coding should be > opus4 for modern frontend stuff.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: