I think this is with and without "tools." They explain it in the system card: > ...

Bjorkbat · 2025-01-31T20:03:19 1738353799

So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.

While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).

It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.

Of course the real improvement is cost, but still, it kind of rubs me the wrong way.

pockmarked19 · 2025-01-31T20:28:25 1738355305

YC usually says “a startup is the point in your life where tricks stop working”.

Sam Altman is somehow finding this out now, the hard way.

Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).

galaxyLogic · 2025-02-01T08:25:28 1738398328

My guess is this cheap mini-model comes out now after DeepSeek very recently shook the stock-market greatly with its cheap price and relatively good performance. .

IanCal · 2025-02-01T14:48:00 1738421280

o3 mini has been coming for a while, and iirc was "a couple of weeks" away a few weeks ago before R1 hit the news.

georgewsinger · 2025-01-31T19:35:00 1738352100

Makes sense. Thanks for the correction.