Hacker News new | past | comments | ask | show | jobs | submit login

So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.

While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).

It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.

Of course the real improvement is cost, but still, it kind of rubs me the wrong way.




YC usually says “a startup is the point in your life where tricks stop working”.

Sam Altman is somehow finding this out now, the hard way.

Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).


My guess is this cheap mini-model comes out now after DeepSeek very recently shook the stock-market greatly with its cheap price and relatively good performance. .


o3 mini has been coming for a while, and iirc was "a couple of weeks" away a few weeks ago before R1 hit the news.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: