Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My takeaway - from this article, from Google’s AlphaEvolve [1], and the recent announcement about o3 finding a zero day in the Linux kernel [2] - is that Gemini Pro 2.5 and o3 in particular have reached a new level of capability where these ideas that were tried unsuccessfully with other models, suddenly just work.

[1] https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...

[2] https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...



In my opinion, I wouldn’t say so much that they are suddenly working. Rather we’ve reached a point where they can iterate and test significantly faster than humans are capable of doing and have the ability to call on significantly more immediately available information that it can make sense of, and as a result, the combination information, advancement and intelligently applied brute force seems to be having success in certain applications.


Good points. I suspect that o3 is able to reason more deeply about different paths through a codebase than earlier models, though, which might make it better at this kind of work in particular.


I was blown away by some debugging results I got from o3 early on and have been using it heavily since. The early results that caught my attention were from a couple cases where it tracked down some problematic cause through several indirect layers of effects in a way where you'd typically be tediously tracing step-by-step through a debugger. I think whatever's behind this capability has some overlap with really solid work it'll do in abstract system design, particularly in having it think through distant implications of design choices.


I’m interested in learning more about how you use o3 for debugging.


The main trick is in how you build up it's context for the problem. What I do is think of it like a colleague I'm trying to explain the bug to: the overall structure is conversational, but I interleave both relevant source chunks and detailed/complete observational info from what I've observed about anomalous program behavior. I typically will send a first message building up context about the program/source, and then build up the narrative context for particular bug in second message. This sets it up with basically perfect context to infer the problem, and sets you up for easy reuse: you can back up, clear that second message and ask something else, reusing detailed program context given by the first message.

Using it on the architectural side you can follow a similar procedure but instead of describing a bug you're describing architectural revisions you've gone through, what your experience with each was, what your objectives with a potential refactor are, where your thinking's at as far as candidate reformulations, and so on. Then finish with a question that doesn't overly constrain the model; you might retry from that conversation/context point with a few variants, e.g.: "what are your thoughts on all this?" or "can you think of better primitives to express the system through?"

I think there are two key points to doing this effectively:

1) Give it full, detailed context with nothing superfluous, and express it within the narrative of your real world situation.

2) Be careful not to "over-prescribe" what it says back to you. They are very "genie-like" where it'll often give exactly what you ask for in a rather literal sense, in incredibly dumb-seeming ways if you're not careful.


In the context of LLMs, what do you mean by "reason"? What does reasoning look like in LLMs and how do you recognize it, and more importantly, how do you invoke it? I haven't had much success in getting LLMs to solve, well, basically any problem that involves logic.

Chain of thought at least introduces some skepticism, but that's not exactly reasoning. It makes me wonder what people refer to when they say "reason".


As best as I have understood, the LLMs output is directly related to the state of the network as a result of the context. Thinking is the way we use intermediate predictions to help steer the network toward a what is expected to be a better result through learned patterns. Reasoning are strategies for shaping that process to produce even more accurate output, generally having a cumulative effect on the accuracy of predictions.


> Reasoning are strategies for shaping that process to produce even more accurate output

How can it evaluate accuracy if it can't even detect contradictions reliably?


It doesn’t? Reasoning is not an analysis; it is the application of learned patterns for a given set of parameters that results in higher accuracy.

Permit my likely inaccurate illustration: You’re pretty sure 2 + 2 is 4, but there are several questions you could ask: are any of the numbers negative, are they decimals, were any numbers left out? Most of those questions are things you’ve learned to ask automatically, without thinking about it, because you know they’re important. But because the answer matters, you check your work by writing out the equation. Then, maybe you verify it with more math; 4 ÷ 2 = 2. Now you’re more confident the answer is right.

An LLM doesn’t understand math per se. If you type “2 + 2 =”, the model isn’t doing math… it’s predicting that “4” is the next most likely token based on patterns in its training data.

“Thinking” in an LLM is like the model shifting mode and it starts generating a list of question-and-answer pairs. These are again the next most likely tokens based on the whole context so far. “Reasoning” is above that: a controlling pattern that steers those question-and-answer sequences, injecting logic to help guide the model toward a hopefully more correct next token.


People think an approximation of a thing is the thing.


Very likely. Larger context is significantly beneficial to the LLMs when they can maintain attention, which was part of my point. Imagine being able to hold the word for word text of your required reading book while you are taking a test, while older models were more like a couple chapters worth of text. Two years ago.


Gemini Pro 2.5 is the first AI that I can productively use for anything other than human language translation, but it's just barely crossed that threshold. Sometimes I get success hit rates below 20%.

When 3.0 comes out, that... that's going to start getting a little scary.


o3 is in my experience often even better, but too slow and too rate limited to use it all the time.


What domain?


SRE / DevOps / coding mostly in the Azure and .NET ecosystems.

The problems I have to solve tend to be the horrible ones that nobody has answers to, anywhere on the Internet, so unsurprisingly the AIs aren't good at it either.

The trick has been to use the AIs for what they are good that, which used to be "nothing" for me at least, but now I can use them productively for certain "spot" tasks.

Random examples:

- Cross-language and cross-platform benchmarking of a bunch of different database clients to see how they stack up. I gave the AI a working example in one language and got it to whip up a series of equivalents with other DB drivers and languages. Sure, it's trivial, but it's way faster than doing it myself!

- Crash dump analysis using WinDbg. I read somwhere that "vibe debugging" of kernel dumps totally works, so when I had an actual crash I gave it a go for laughs. With AI help I managed to extract the name of the specific file that had NTFS corruption and was crashing the server. Deleted the file, restored it from backups, and the server was good to go again!

- If you ever watch the top mechanical engineers on YouTube, they all make their own tools instead of just buying them. Jigs, extenders, unusual sizes, etc... IT work is the same. As a recent example, I got Gemini to make me a code-AST rewriter for a specific issue I wanted to clean up in bulk across a huge code base. Using the Roslyn compiler SDK is a bit fiddly, but it spat out a working tool for me in under an hour. (This is not something you can solve with a script full of regex, it needed a proper parser to handle commented-out blocks and the like.)


> Sure, it's trivial, but it's way faster than doing it myself

That's the clincher for me. So much software work is just excecuting on a design, not inventing anything new. Being able to do 5x the trivial work in an hour is life changing, and it lets me pull my head out of that work to see how I can make larger process improvements. AI doesn't need to rewrite the linux kernel in Rust to be extremely valuable to the average developer


Sounds like interesting work, thanks for sharing! "Vibe debugging", hah, I like that one. The latest crop of models is definately unlocking new capabilities, and I totally get the desire to make your own tools. I do that to a fault sometimes, but it's nice to have a simple tool that does exactly one thing, exactly the way you want it.

I've been pair programming with the models for a while, and wrote some "agents" before I knew to call it that back in the dark days of GPT-3.5, but only recently with the latest models unlocking capabilities beyond what I could achieve with handwritten code.


It’s true that there are similarities between what you mentioned and what’s happening in this case. From the article:

> The result is a test-time loop that looks less like “chat with a compiler” in the case of sequential revision, and more like structured exploratory search, guided by explicit optimization hypotheses and aggressively parallel evaluation.

My conclusion would be that we’ve now learned to apply LLMs’ capabilities to shrink solution space where we have a clear evaluation function as well as solutions to problems that might follow similar patterns. This applies in this case as well.

IMO, It’s not about model X gaining on other models or model Y being able to reason about the solutions, etc. in a way that other models couldn’t.


Interesting. Do you have stronger evidence to support your claim? A sample size of one is pretty unconvincing.


Wait, what are you saying? These have nothing to do with the Linux kernel whatsoever, they are "kernels" in the GPU programming sense. Did you just hallucinate this whole comment or what?


Sorry, I added links! Just a week ago someone built a system that used o3 to find novel zero days in the Linux kernel’s SMB implementation.


Theres zero days in obscure parts of the kernel nobody uses every other day. (It also of course found 100 other things that were not zero days or vulnerabilities, yet professed they were, which is why this trash even on Gemini 9000 Pro keeps spamming security mails)


There was a post on HN a bit ago from someone who used o3 to find a vulnerability in the Linux kernel's SMB server, which this person is just saying should've been tried earlier and probably recently became possible




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: