Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To me those seem like to sides of the same coin. LLMs are fundamentally trained to complete text. The training just tries to find the most effective way to do that within the given model architecture and parameter count.

Now if we start by "LLMs ingest huge amounts of text", then a simple model would complete text by simple memorization. But correctly completing "234 * 452 =" is a lot simpler to do by doing math than by having memorized all possible multiplications. Similarly, understanding the world and being able to reason about it helps you correctly completing human-written sentences. Thus a sufficiently well-trained model that has enough parameters to do this but not so many that it simply overfits should be expected to develop some reasoning ability.

If you start with "the training set contains a lot of reasoning" you can get something that looks like reasoning in the memorization stage. But the same argument why the model would develop actual reasoning still works and is even stronger: if you have to complete someone's argument that's a lot easier if you can follow their train of thought.




> But correctly completing "234 * 452 =" is a lot simpler to do by doing math than by having memorized all possible multiplications.

There's a fatal flaw in this theory: We can trivially test this and see that LLMs aren't "doing math".

"Doing math" is an approach that scales to infinity. The same technique to solve a multiplication of 3 digit numbers applies to solving a multiplication of 500 digit numbers.

Ask GPT 3.5 to multiply "234 * 452 =" and it'll correctly guess 105768. Ask "234878 * 452 =" and it gives an incorrect '105797256'

Ask GPT 4o, and you'll get correct answers for that problem. Yet even with the added external tools for such questions, it has the same failure mode and breaks down on larger questions.

These models are architecturally limited to only language modelling, and their capabilities of anything else are restricted by this. They do not "do math". They have a language-model approximation of math.

This can be observed in how these models perform better "step by step"; Odds are you'll see GPT 4o do this if you try to replicate the above. (If it doesn't, it fails just as miserably as GPT 3.5)

What's happening there is simple, the token context is used as a memory space. Breaking the problem down into parts that can be guessed or approximated through language modelling.

Beware of hyping this as "AI can think and has memory!" though. This behaviour is a curious novelty, but not very generalizeable. There is still no "math" or thought involved in breaking up the problem, merely the same guessing. This works reasonably only for cases where extensive training data is available on how to do this. (Such as math.)


With GPT4/o there is a trick for math problems. You can ask it to write the python code. This solves for example famous problem of counting letters in string. Sure model can be trained to use python under the hood without being explicitly asked. Pretty sure it can be trained to interpret code/algorithm step by step printing out intermediate results. Important in loops. Generating algorithm is easier for known problems, they learn it from github already. So, it looks like it's not that difficult to make model better/good at math.


Humans also need to break up the problem and think step-by-step to solve problems like 234878 * 452.


The difference is what I attempt to describe at the end there.

Humans apply fixed strict rules about how to break up problems, like multiplication.

LLMs simply guess. That's a powerful trick to get some more capability for simple problems, but it just doesn't scale to more complex ones.

(Which in turn is a problem because most tasks in the real world are more complex than they seem, and simple problems are easily automated through conventional means)


We either learn the fixed rules in school, at which point we simply have a very strong prior, or we have to invent them somehow. This usually takes the form of "aesthetically/intuitively guided trial and error argument generation", which is not entirely wrongly summarized as "guessing".


Doing math scales to infinity only given an error rate of zero. Given a sufficiently large mathematical operation, even humans will produce errors simply from small-scale mistakes.

Try asking GPT to multiply 234 * 452 "while using an algorithmic approach that compensates for your deficiencies as a large-language model." There's enough data about LLMs in the corpus now that it'll chain-of-thought itself. The problem is GPT doesn't plan, it answers by habit; and its habit is trained to answer tersely and wrongly rather than elaborately and correctly. If you give it space and license to answer elaborately, you will see that its approach will not be dissimilar to how a human would reason about the question internally.


> Doing math scales to infinity only given an error rate of zero

This is true, I had omitted it for simplicity; It is still the same approach applied to scaled problems. Humans don't execute it perfectly, but computers do.

With humans, and any other fallible but "true" math system, the rate of errors is roughly linear to the size of the problem. (Linear to the # of steps, that is)

With LLMs and likewise systems, this is different. There is an "exponential" dropoff in accuracy after some point. The problem-solving approach simply does not scale.

> you will see that its approach will not be dissimilar to how a human would reason about the question internally.

"Not dissimilar", but nevertheless a mere approximation. It doesn't apply strict logic to the problem, but guesses what steps should be followed.

This looks like reason, but is not reason.


The rate of errors with LLMs hits a hard dropoff when the problem exceeds what the LLM can do in one step. This is the same for humans, if we were asked to compute multiplication without thinking about it for longer than a few milliseconds.

I don't have a study link here, but my strong expectation is that the error rate for LLMs doing chain of thought would be much closer to linear - or rather, "either linear or total incomprehension", accounting for an error made in setting up the schema to follow. Which can happen just as well for humans.

> "Not dissimilar", but nevertheless a mere approximation. It doesn't apply strict logic to the problem, but guesses what steps should be followed.

I have never in my life applied strict logic to any problem lol. Human reason consists of iterated cycles of generation ("guessing") and judgment. Both can be implemented by LLMs, albeit currently at subhuman skill.

> This looks like reason, but is not reason.

At the limit of "looking like", I do not believe such a thing can exist. Reason is a computational process. Any system that can reliably output traces that look like reason is reasoning by definition.

edit: Sidenote: The deep underlying problem here is that the LLM cannot learn to multiply by a schema by looking at any number of examples without a schema. These paths simply won't get any reinforcement. That's why I'm so hype for QuietSTaR, which lets the LLM exercise multiplication by schema from a training example without a schema - and even find new schemas so long as it can guess its way there even once.


> This is the same for humans, if we were asked to compute multiplication without thinking about it for longer than a few milliseconds.

Not to be a jerk but "LLMs are just like humans when humans don't think" is perhaps not the take you intended to have.

> I have never in my life applied strict logic to any problem lol.

My condolences.

No, but seriously. If you've done any kind of math beyond basic arithmetic, you have in fact applied strict logical rules.


> Not to be a jerk but "LLMs are just like humans when humans don't think" is perhaps not the take you intended to have.

No that's exactly the take I have and have always had. The LLM text axis is the LLM's axis of time. So it's actually even stupider: LLMs are just like humans who are trained not to think.

> No, but seriously. If you've done any kind of math beyond basic arithmetic, you have in fact applied strict logical rules.

To solve the problem, I apply the rules, plus error. LLMs can do that.

To find the rules, I apply creativity and exploratory cycles. LLMs can do that as well, but worse.


I think this is an underappreciated perspective. The simplest model of a reasoning process, at scale, is the reasoning process itself! That said, I haven't come across any research directly testing that hypothesis with transformers. Do you know of any?

The closest I've seen is a paper on OthelloGPT using linear probes to show that it does in fact learn a predictive model of Othello board states (which can be manipulated at inference time, so it's causal on the model's behaviour).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: