Good points. I suspect that o3 is able to reason more deeply about different paths through a codebase than earlier models, though, which might make it better at this kind of work in particular.
I was blown away by some debugging results I got from o3 early on and have been using it heavily since. The early results that caught my attention were from a couple cases where it tracked down some problematic cause through several indirect layers of effects in a way where you'd typically be tediously tracing step-by-step through a debugger. I think whatever's behind this capability has some overlap with really solid work it'll do in abstract system design, particularly in having it think through distant implications of design choices.
The main trick is in how you build up it's context for the problem. What I do is think of it like a colleague I'm trying to explain the bug to: the overall structure is conversational, but I interleave both relevant source chunks and detailed/complete observational info from what I've observed about anomalous program behavior. I typically will send a first message building up context about the program/source, and then build up the narrative context for particular bug in second message. This sets it up with basically perfect context to infer the problem, and sets you up for easy reuse: you can back up, clear that second message and ask something else, reusing detailed program context given by the first message.
Using it on the architectural side you can follow a similar procedure but instead of describing a bug you're describing architectural revisions you've gone through, what your experience with each was, what your objectives with a potential refactor are, where your thinking's at as far as candidate reformulations, and so on. Then finish with a question that doesn't overly constrain the model; you might retry from that conversation/context point with a few variants, e.g.: "what are your thoughts on all this?" or "can you think of better primitives to express the system through?"
I think there are two key points to doing this effectively:
1) Give it full, detailed context with nothing superfluous, and express it within the narrative of your real world situation.
2) Be careful not to "over-prescribe" what it says back to you. They are very "genie-like" where it'll often give exactly what you ask for in a rather literal sense, in incredibly dumb-seeming ways if you're not careful.
In the context of LLMs, what do you mean by "reason"? What does reasoning look like in LLMs and how do you recognize it, and more importantly, how do you invoke it? I haven't had much success in getting LLMs to solve, well, basically any problem that involves logic.
Chain of thought at least introduces some skepticism, but that's not exactly reasoning. It makes me wonder what people refer to when they say "reason".
As best as I have understood, the LLMs output is directly related to the state of the network as a result of the context. Thinking is the way we use intermediate predictions to help steer the network toward a what is expected to be a better result through learned patterns. Reasoning are strategies for shaping that process to produce even more accurate output, generally having a cumulative effect on the accuracy of predictions.
It doesn’t? Reasoning is not an analysis; it is the application of learned patterns for a given set of parameters that results in higher accuracy.
Permit my likely inaccurate illustration:
You’re pretty sure 2 + 2 is 4, but there are several questions you could ask: are any of the numbers negative, are they decimals, were any numbers left out? Most of those questions are things you’ve learned to ask automatically, without thinking about it, because you know they’re important. But because the answer matters, you check your work by writing out the equation. Then, maybe you verify it with more math; 4 ÷ 2 = 2. Now you’re more confident the answer is right.
An LLM doesn’t understand math per se. If you type “2 + 2 =”, the model isn’t doing math… it’s predicting that “4” is the next most likely token based on patterns in its training data.
“Thinking” in an LLM is like the model shifting mode and it starts generating a list of question-and-answer pairs. These are again the next most likely tokens based on the whole context so far. “Reasoning” is above that: a controlling pattern that steers those question-and-answer sequences, injecting logic to help guide the model toward a hopefully more correct next token.
Very likely. Larger context is significantly beneficial to the LLMs when they can maintain attention, which was part of my point. Imagine being able to hold the word for word text of your required reading book while you are taking a test, while older models were more like a couple chapters worth of text. Two years ago.