The fact that Gemini 3 is so far ahead of every other frontier model in math might be telling us something more general about the model itself.
It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1.
This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.
To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.
The SimpleQA benchmark is another datapoint that we're probably looking at a research breakthrough, not just more data or more compute. Gemini 3 Pro achieved more than double the reliability of GPT-5.1 (72.1% vs. 34.9%).
This isn't an incremental gain, it's a step-change leap in reducing hallucinations.
And it's exactly what you'd expect to see if there's an underlying shift from probabilistic token prediction to verified search, with better error detection and backtracking when it finds an error.
That could explain the breakout performance on math, and reliability, and even operating graphical user interfaces (ScreenSpot-Pro at 72.7% vs. 3.5% for GPT-5.1).
I usually ask a simple question that ALL the models get wrong: List of mayor of my city [Londrina]. ALL the models (offine) get wrong. And I mean, all the models. The best that I could, it's o3 I believe, saying it couldn't give a good answer for that, and told to access the city website.
Gemini 3 somehow is able to give a list of mayors, including details on who got impeached, etc.
This should be a simple answer, because all the data is on wikipedia, that certainly the models are trained on, but somehow most models don't manage to give that answer right, because... it's just a irrelevant city in a huge dataset.
But somehow, Gemini 3 did it.
Edit: Just asked "Cool places to visit in Londrina" (In portuguese), and it was also 99% right, unlike other models, who just create stuff. The only thing wrong here, it mentioned sakuras in a lake... Maybe it confused with Brazilian ipês, which are similar, and indeed the city it's full of them.
Ha, I just did the same with my hometown (Guaiba, RS), a city that is 1/6th of Londrina, and its wikipedia page in English hasn't been updated in years, and still has the wrong mayor (!).
Gemini 3 nailed on the first try, included political affiliation, and added some context on who they competed with and won over in each of the last 3 elections. And I just did a fun application with AI Studio, and it worked on first shot. Pretty impressive.
(disclaimer: Googler, but no affiliation with Gemini team)
Pure fact-based, niche questions like that aren't really the focus of most providers any more from what I've heard, since they can be solved more reliably by integrating search tools (and all providers now have search).
I wouldn't be surprised if the smallest models can answer fewer such (fact-only) questions over time offline as they distill/focus them more thoroughly on logic etc.
Funny, I just asked "Ask Brave", which uses a cheap LLM connected directly to its search engine, and it got it right without any issues.
It shows once again that for common searches, (indexed) data is the king, and that's where I expect that even a simple LLM directly connected to a huge indexed dataset would win against much more sophisticated LLMs that have to use agents for searching.
The one thing I got out of the MIT OpenCourseWare AI course by Patrick Winston was that all of AI could be framed as a problem of search. Interesting to see Demis echo that here.
It tells me that the benchmark is probably leaking into training data, and going to the benchmark site :
> Model was published after the competition date, making contamination possible.
Aside from eval on most of these benchmarks being stupid most of the time, these guys have every incentive to cheat - these aren't some academic AI labs, they have to justify hundreds of billions being spent/allocated in the market.
Actually trying the model on a few of my daily tasks and reading the reasoning traces all I'm seeing is same old, same old - Claude is still better at "getting" the problem.
> To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You say "probabilistic generation" like it's some kind of a limitation. What is exactly the limiting factor here? [(0.9999, "4"), (0.00001, "four"), ...] is a valid probability distribution. The sampler can be set to always choose "4" in such cases.
I'll give you the style is like an LLM but the thoughts seem a bit unlike one. I mean the MathArena Apex results indicating a new discovery rather than more data is definitely a hypothesis.
From my understanding, Google put online the largest RL cluster in the world not so long ago. It's not surprising they do really well on things that are "easy" to RL, like math or SimpleQA
I’ll take you at your word, sorry for the incorrect callout. Your comment format appeared malicious, so my response wasn’t an attempt at being “snarky”, just acting defensively. I like the HN Rules/Guidelines.
You mentioned "step change" twice. Maybe a once over next time? My favorite Mark Twain quote is (very paraphrased) "My apologies, had I more time, I would have written a shorter letter".
This is something that is happening to me too, and frankly I'm a little concerned. English is not my first language, so I use AI for checking and writing many things. And I spend a lot of time with coding tools. And now I need sometimes to do a conscient effort to avoid mimicking some LLM patterns...
You seem very comfortable making unfounded claims. I don't think this is very constructive or adds much to the discussion. While we can debate the stylistic changes of the previous commenter, you seem to be discounting the rate at which the writing style of various LLMs has backpropagated into many peoples' brains.
I can sympathize with being mistakingly accused of using LLM output, but as a reader the above format of "Its not x - it's y" repeated multiple times for artificial dramatic emphasis to make a pretty mundane point that could use 1/3 the length grates on me like reading LinkedIn or marketing voice whether it's AI or not (and it's almost always AI anyway).
I've seen fairly niche subreddits go from enjoyable and interesting to ruined by being clogged with LLM spam that sounds exactly like this so my tolerance for reading it is incredibly low, especially on HN, and I'll just dismiss it.
I probably lose the occasionally legitimate original observation now and then but in a world where our attention is being hijacked by zero effort spam everywhere you look I just don't have the time or energy to avoid that heuristic.
Also discounting the fact that people actually do talk like that. In fact, these days I have to modify my prose to be intentionally less LLM-like lest the reader thinks it's LLM output.
1) Models learn these patterns from common human usage. They are in the wild, and as such there will be people who use them naturally.
2) Now, given its for-some-reason-ubiquitous choice by models, it is also a phrasing that many more people are exposed to, every day.
Language is contagious. This phrasing is approaching herd levels, meaning models trained from up-to-the-moment web content will start to see it as less distinctly salient. Eventually, there will be some other high-signal novel phrase with high salience, and the attention heads will latch on to it from the surrounding context, and then that will be the new AI shibboleth.
It's just how language works. We see it in the mixes between generations when our kids pick up new lingo, and then it stops being in-group for them when it spreads too far.. Skibidi, 6 7, etc.
It's just how language works, and a generation ago the internet put it on steroids. Now? Even faster.
It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1.
This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.
To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.
The SimpleQA benchmark is another datapoint that we're probably looking at a research breakthrough, not just more data or more compute. Gemini 3 Pro achieved more than double the reliability of GPT-5.1 (72.1% vs. 34.9%).
This isn't an incremental gain, it's a step-change leap in reducing hallucinations.
And it's exactly what you'd expect to see if there's an underlying shift from probabilistic token prediction to verified search, with better error detection and backtracking when it finds an error.
That could explain the breakout performance on math, and reliability, and even operating graphical user interfaces (ScreenSpot-Pro at 72.7% vs. 3.5% for GPT-5.1).