Does anyone remember the "Mad Libs" games - you fill out a form with blanks for "verb", "noun", "adjective", etc - then on the next page you fill in the words from the form to create a silly story. The results are funny because the words you provided initially were without context - they were syntactically correct, but were nonsense in context.
LLM's are like Mad Libs with a "contextual predictor" - they produce syntactically correct output, and the "contextual predictor" limits the amount of nonsense because statistical correlations can generate meaningful output most of the time. But there is no "reasoning" occurring here - just syntactic templating and statistical auto-complete.
Yes, but it's a hugely almost unimaginably complicated auto-complete model. And it turns out that a lot of human reasoning is statistically predictable enough in writing that you can actually obtain reasoning-like behavior just by having a good auto-complete model.
You shouldn't trvialize how amazingly well it does work, and how surprising it is that it works, just because it doesn't work in all cases.
Literally the whole point of TFA is to explore how this phenomenon of something-like-reasoning arises out of a sufficiently huge autocomplete model.
> And it turns out that a lot of human reasoning is statistically predictable enough in writing that you can actually obtain reasoning-like behavior just by having a good auto-complete model.
I would disagree with this on a technicality that changes the conclusion. It's not that human reasoning is statistically predictable (though it may be), it's that all of the writing that has ever described human reasoning on an unimaginable number of topics is statistically summarizable, and therefore having a good auto-complete model does a good job of describing human reasoning that has been previously described at least combinatorially across various sources.
We don't have direct access to anyone else's reasoning. We infer their reasoning by seeing/hearing it described, then we fill in the blanks with our own reasoning-to-description experiences. When we see a model that's great at mimicking descriptions of reasoning, it triggers the same inferences, and we conclude similar reasoning must be going on under the hood. It's like the ELIZA Effect on steroids.
It might be the case that neural networks could theoretically, eventually reproduce the same kind of thinking we experience. But I think it's highly unlikely it'd be a single neural network trained on language, especially given the myriad studies showing the logic and reasoning capabilities of humans that are distinct from language. It'd probably be a large number of separate models trained on different domains that come together. At that point though, there are several domains that would be much more efficiently represented with something other than a neural network model, such as the modeling of physics and mathematics with equations (just because we're able to learn them with neurons in our brains doesn't mean that's the most efficient way to learn or remember them).
While a "sufficiently huge autocomplete model" is impressive and can do many things related to language, I think it's inaccurate to claim they develop reasoning capabilities. I think of transformer-based neural networks as giant compression algorithms. They're super lossy compression algorithms with super high compression ratios, which allows them to take in more information than any other models we've developed. They work well, because they have the unique ability to determine the least relevant information to lose. The auto-complete part is then using the compressed information in the form of the trained model to decompress prompts with astounding capability. We do similar things in our brains, but again, it's not entirely tied to language; that's just one of many tools we use.
> We don't have direct access to anyone else's reasoning. We infer their reasoning by seeing/hearing it described, then we fill in the blanks with our own reasoning-to-description experiences. When we see a model that's great at mimicking descriptions of reasoning, it triggers the same inferences, and we conclude similar reasoning must be going on under the hood. It's like the ELIZA Effect on steroids.
I don't think we know enough of how these things work yet to conclude that they are definitely not "reasoning" in at least a limited subset of cases, in the broadest sense wherein ELIZA is also "reasoning" becuase it's following a sequence of logical steps to produce a conclusion.
Again, that's the point of TFA: something in the linear algebra stew does seem to produce reasoning-like behavior, and we want to learn more about it.
What is reasoning if not the ability to assess "if this" and conclude "then that"? If you can do it with logic gates, who's to say you can't do it with transformers or one of the newer SSMs? And who's to say it can't be learned from data?
In some sense, ELIZA was reasoning... but only within a very limited domain. And it couldn't learn anything new.
> It might be the case that neural networks could theoretically, eventually reproduce the same kind of thinking we experience. But I think it's highly unlikely it'd be a single neural network trained on language, especially given the myriad studies showing the logic and reasoning capabilities of humans that are distinct from language. It'd probably be a large number of separate models trained on different domains that come together.
Right, I think we agree here. It seems like we're hitting the top of an S-curve when it comes to how much information the transformer architecture can extract from human-generated text. To progress further, we will need different inputs and different architectures / system designs, e.g. something that has multiple layers of short- and medium-term working memory, the ability to update and learn over time, etc.
My main point is that while yes, it's "just" super-autocomplete, we should consider it within the realm of possibility that some limited form of reasoning might actually be part of the emergent behavior of such an autocomplete system. This is not AGI, but it's both suggestive and tantalizing. It is far from trivial, and greatly exceeds what anyone expected should be possible just 2 years ago. If nothing else, I think it tells us that maybe we do not understand the nature of human rationality as well as we thought we did.
> What is reasoning if not the ability to assess "if this" and conclude "then that"?
A lot of things. There are entire fields of study which seek to define reasoning, breaking it down into areas that include logic and inference, problem solving, creative thinking, etc.
> If you can do it with logic gates, who's to say you can't do it with transformers or one of the newer SSMs? And who's to say it can't be learned from data?
I'm not saying you can't do it with transformers. But what's the basis of the belief that it can be done with a single transformer model, and one trained on language specifically?
More specifically, the papers I've read so far that investigate the reasoning capabilities of neural network models (not just LLMs) seem to indicate that they're capable of emergent reasoning about the rules governing their input data. For example, being able to reverse-engineer equations (and not just approximations of them) from input/output pairs. Extending these studies would indicate that large language models are able to emergently learn the rules governing language, not necessarily much beyond that.
It makes me think of two anecdotes:
1. How many times have you heard someone say, "I'm a visual learner"? They've figured out for themselves that language isn't necessarily the best way for them to learn concepts to inform their reasoning. Indeed there are many concepts for which language is entirely inefficient, if not insufficient, to convey. The world's shortest published research paper is proof of this: https://paperpile.com/blog/shortest-papers/.
2. When I studied in school, I noticed that for many subjects and tests, sufficient rote memorization became indistinguishable from actual understanding. Conversely, better understanding of underlying principles often reduced the need for rote memorization. Taken to the extreme, there are many domains for which sufficient memorization makes actual understanding and reasoning unnecessary.
Perhaps the debate on whether LLMs can reason is a red herring, given that their ability to memorize surpasses any human by many orders of magnitude. Perhaps this is why they seem able to reason, especially given that our only indication so far is the language they output. The most useful use-cases are typically those which are used to trigger our own reasoning more efficiently, rather than relying on theirs (which may not exist).
I think the impressiveness of their capabilities is precisely what makes exaggeration unnecessary.
Saying LLMs develop emergent logic and reasoning, I think, is a stretch. Saying it's "within the realm of possibility that some limited form of reasoning might actually be part of the emergent behavior" sounds more realistic to me, though rightly less sensational.
EDIT:
I also think it's fair to say that the ELIZA program had the limited amount of reason that was programmed into it. However, the point of the ELIZA study was that it shows people's tendency to overestimate the amount of reasoning happening, based on their own inferences. This is significant, because this causes us to overestimate the generalizability of the program, which can lead to unintended consequences when reliance increases.
> But there is no "reasoning" occurring here - just syntactic templating and statistical auto-complete.
This is the "stochastic parrot" hypothesis, which people feel obligated to bring up every single time there's a LLM paper on HN.
This hypothesis isn't just philosophical, it can lead to falsifiable predictions, and experiments have thoroughly falsified them: LLMs do have a world model. See OthelloGPT for the most famous paper on the subject; see Transformers Represent Belief State Geometry in their Residual Stream for a more recent one.
Well we don't have an understanding of how the brain works so we can't be fully sure but it's clear why they have this intuition:
1) Many people have had to cram for some exam where they didn't have time to fully understand the material. So for those parts they memorized as much as they could and got through the exam by pattern matching. But they knew there was a difference because they knew what it was like to fully understand something where that they could reason about it and play with it in their mind.
2) Crucially, if they understand the key mechanism early then they often don't need to memorize anything (the opposite of LLM's which need millions of examples)
3) LLM's display attributes of someone who has crammed for an exam and when it is probed further [1] it starts to break down in exactly the same way a crammer does.
I understand why they intuitively think it isn't. I also think there is probably something more to reasoning. I'm just mystified by why they are so sure it isn't.
Logic is a syntactic formalism that humans often apply imperfectly. That certainly sounds like we could be employing syntactic templating and statistical auto-complete.
I was trying to tease apart whether you were talking about human behavior or the abstract concept of 'reasoning'. The latter is formalized in logic and has parts that are not merely syntactic (with or without stochastic autocomplete).
You seem to be confusing logic and proofs with any kind of random rhetoric or syntactically-correct opinion which might in terms of semantics be total nonsense. If you really don't understand that there's a difference between these things, then there's probably no difference between anything else either, and since things that are indiscernible must be identical, I conclude that I must be you, and I declare myself wrong, thus you are wrong too. Are we enjoying this kind of "reasoning" yet or do we perhaps want a more solid rock on which to build the church?
I don't know what claim you think I'm making that you inferred from my 5 sentences, but it's really simple. Do you agree or disagree that humans make mistakes on logical deduction?
I certainly hope you agree, in which case it follows that a person's understanding of any proposition, inference or deduction is only probabilistic with some certainty less than one. When they believe or make a mistake in deduction, they are going through the motions of applying logic without actually understanding what they're doing, which I suppose you could whimsically call "hallucinating". A person will typically continue to repeat this mistaken deduction until someone corrects them.
So if our only example of "reasoning" seems to share many of the same properties and flaws as LLMs, albeit at a lower rate, and that correcting this Paragon of reasoning is basically what we also do with LLMs (have them review their own output or check it against another LLM), this claim to human specialness starts to look a lot like special pleading.
I haven't made any claim that humans are special. And your claim, in your own words, is that if mistakes are made in logical deduction, that means that the agent involved must ultimately be employing statistical auto-complete? No idea why you would think that, or what else you want to conclude from it, but it's obviously not true. Just consider an agent that inverts every truth value you try to put into the knowledge base and then proceeds as usual with anything you ask it to do. It makes mistakes and has nothing to at all do with probability, therefore some systems that make mistakes aren't LLMs. QED?
Ironically the weird idea that "all broken systems must be broken in the same way" or even "all broken systems use equivalent mechanics" is exactly the type of thing you get by leaning on a language model that really isn't even trying to understand the underlying logic.
> I haven't made any claim that humans are special
The whole context of this thread is that humans are "reasoning" and LLMs are just statistical syntax predictors, which is "lesser", ie. humans are special.
> And your claim, in your own words, is that if mistakes are made in logical deduction, that means that the agent involved must ultimately be employing statistical auto-complete?
No, I said humans would be employing statistical auto-complete. The whole point of this argument is to show that this allegedly non-statistical, non-syntactic "reasoning" that humans are doing that supposedly makes them superior to statistical, syntactic processing that LLMs are doing, is mostly a fiction.
> leaning on a language model that really isn't even trying to understand the underlying logic.
You don't know that the LLM is not understanding. In fact, for certain rigorous formal definitions of "understanding", it absolutely does understand something. You can only reliably claim LLMs don't understand everything as well as some humans.
In no way does "Turing Completeness" imply the ability to reason - I mean it's like arguing that a nightlight "reasons" about if it is dark out or not.
However, if reason is computable, then a syntactic transformation can compute it. The point is that stating that something is a "mere" syntactic transformation does not imply computational weakness.
> A system that is Turing Complete absolutely can be programmed to reason, aka it has the ability to reason.
you can write C program which can reason, but C compiler can't reason. So, program part is missing between "Turing Completeness" and reasoning, and it is very non-trivial part.
Given "reasoning" is still undefined, I would not go so far as to claim that a C compiler is not reasoning. What if a C compiler's semantic analysis pass is a limited form of reasoning?
Furthermore, the C compiler can do a lot more than you think. The P99/metalang99 macro toolkits give the preprocessor enough state space to encode and run an LLM, in principle.
I can define "reasoning". Given number of observations and inference rules, infer new calculated observations.
> What if a C compiler's semantic analysis pass is a limited form of reasoning?
I guess you can say that C compiler can reason in specific narrow domain, because it is also a program and someone programmed it to reason in that domain.
I think C compiler was wrong analogy, because it is also a program. More correct could refer on some machine which executes ASM/C/bytecode etc. That machine(e.g. CPU or VM) is turing complete, but one need to write program to do reasoning. C compiler doing some semantic reasoning over say datatypes is example of such program.
The network has specific circuits that correspond to concepts and you can see that the network uses and combines those concepts to work through problems. That is reasoning.
Under this definition an 74LS21 AND gate is reasoning - it has specific circuits that correspond to concepts, and it uses that network to determine an output based on the input. Seems pretty overly broad - we run back into the issue of saying that a nightlight or thermostat is reasoning.
For true reasoning you really need to introduce the ability for the circuit to intentionally decide to do something different that is not just a random selection or hallucination - otherwise we are just saying that state machines "reason" for the sake of using an anthropomorphic word.
This restriction makes it impossible to determine if something is reasoning. An LLM may well intentionally make decisions; I have as much evidence for that as I have for anybody else doing so, ie. zilch. I'm not even sure that I make intentional decisions, I can only say that it feels like I do. But free will isn't really compliant with my model of physical reality.
Of course logic gates apply logical reasoning to solve problems, they are not much use for anything else (except as a space heater if there are a lot of them).
"Reasoning" implies the extrapolation of information - not the mechanical generation of a fixed output based on known inputs. No one would claim that a set of gears is "reasoning" but the logic gate is as fixed in it's output as a transmission.
LLM's are like Mad Libs with a "contextual predictor" - they produce syntactically correct output, and the "contextual predictor" limits the amount of nonsense because statistical correlations can generate meaningful output most of the time. But there is no "reasoning" occurring here - just syntactic templating and statistical auto-complete.