It seems obvious to me that LLMs wouldn't be able to find examples of every single problem posed to them in training data. There wouldn't be enough examples for the factual look up needed in an information retrieval style search. I can believe that they're doing some form of extrapolation to create novel solutions to posed problems.
It's interesting that this paper doesn't contradict the conclusions of the Apple LLM paper[0], where prompts were corrupted to force the LLM into making errors. I can also believe that LLMs can only make small deviations from existing example solutions in creation of these novel solutions.
I hate that we're using the term "reasoning" for this solution generation process. It's a term coined by LLM companies to evoke an almost emotional response on how we talk about this technology. However, it does appear that we are capable of instructing machines to follow a series of steps using natural language, with some degree of ambiguity. That in of itself is a huge stride forward.
I very much agree with the perspective that LLMs are not suited for “reasoning” in the sense of creative problem solving or application of logic. I think that the real potential in this domain is having them act as a sort of “compiler” layer that bridges the gap between natural language - which is imprecise - and formal languages (sql, prolog, python, lean, etc) that are more suited for solving these types of problems. And then maybe synthesizing the results / outputs of the formal language layer. Basically “agents”.
That being said, I do think that LLMs are capable of “verbal reasoning” operations. I don’t have a good sense of the boundaries that distinguish the logics - verbal, qualitative, quantitative reasoning. What comes to my mind is the verbal sections of standardized tests.
> I think that the real potential in this domain is having them act as a sort of “compiler” layer that bridges the gap between natural language - which is imprecise - and formal languages (sql, prolog, python, lean, etc) that are more suited for solving these types of problems. And then maybe synthesizing the results / outputs of the formal language layer. Basically “agents”.
Well, if you do all that, would you say that the system has a whole has 'reasoned'? (I think ChatGPT can already call out to Python.)
I can believe that they're doing some form of extrapolation to create novel solutions to posed problems
You can believe it what sort of evidence are you using for this belief?
Edit: Also, the abstract of the Apple paper hardly says "corruption" (implying something tricky), it says that they changed the initial numerical values
> It's a term coined by LLM companies to evoke an almost emotional response on how we talk about this technology.
Anthropomorphizing computers has been happening long before ChatGPT. No one thinks their computer is actually eating their homework when they say that to refer to the fact that their computer crashed and their document wasn't saved, it's just an easy way to refer to the thing it just did. Before LLMs, "the computer is thinking" wasn't an unuttered sentence. Math terms aren't well known to everybody, so saying Claudr is dot-producting an essay for me, or I had ChatGPT dot-product that letter to my boss, no one knows that a dot product is, so even if that's a more technically accurate verb, who's gonna use it? So while AI companies haven't done anything to promote usage of different terms than "thinking" and "reasoning", it's also because those are the most handy terms. It "thinks" there are two R's in strawberries. It dot-products there are two R's in strawberries. It also matrix multiplies, occasionally softmaxes; convolves. But most people aren't Terence Tao and don't have a feel for when something's softmaxing because what even does that mean?
Totally, these companies are pushing towards showcasing their AI models as self thinking and reasoning AI while they are just trained of a lot of amount of data in dataset format which they extrapolate to find the right answer.
They still can't think outsider their box of datasets
It's interesting that this paper doesn't contradict the conclusions of the Apple LLM paper[0], where prompts were corrupted to force the LLM into making errors. I can also believe that LLMs can only make small deviations from existing example solutions in creation of these novel solutions.
I hate that we're using the term "reasoning" for this solution generation process. It's a term coined by LLM companies to evoke an almost emotional response on how we talk about this technology. However, it does appear that we are capable of instructing machines to follow a series of steps using natural language, with some degree of ambiguity. That in of itself is a huge stride forward.
[0] https://machinelearning.apple.com/research/gsm-symbolic