That's remarkably dismissive. He addresses your argument right away in the paper, and in multiple places? Did you read it?
It's not about testing the ability to do arithmetic. It's testing the ability to plan a reasoning process / argument.
It's also not the sum of his paper, only one section.
The problem is not that GPT-4 can't to the math problems, it's that it can't reason out how it would begin approach doing a math problem -- it would be totally okay for it to get them wrong, if it was actually making an attempt and could show evidence of working through them,.
Instead it just produces "answers" which are a statistical guess based on other things it has seen on the internet. It's true humans do this, too -- often a first lazy approximation for a problem -- but the key difference is a human can reason out, through interrogation and introspection where it might have gone wrong. GPT-4 appears to be unable to do that.
And worse, my experience with these systems (and the paper's) is that during dialogue about errors they actually rapidly degrade in quality of answer.
I'm probably as bad as a high school student at formal logic, too. But if you sit down with me with a problem and we talk about it, and I'm interested, it will become evident I am capable of reasoning through it, even if I make mistakes. That's not the case with GPT-4.
> Instead it just produces "answers" which are a statistical guess based on other things it has seen on the internet.
It boggles my mind that folks expect otherwise from a Machine Learning tool, no matter how advanced and stuffed with data it may be. Perhaps it's the same phenomenon that causes us humans to see faces in clouds, smiles on dogs, and Jesus' likeness on toast?
Somewhere I definitely read about how human psychology makes us prone to that sort of thing. Even as for back as Eliza, cognitive scientists were commenting on how our thinking can be fooled.
I think there's an ideological bias in our culture that pushes people to believe that intelligent or structured phenomenon inevitably emerge organically and progressively from complex phenomena.
Teleological thinking -- a kind of imagining of purpose and cause from chaotic/natural events and entities -- riddles popular thinking, especially from people in our profession. Science fiction is especially full of it.
It's not just restricted to this domain at all. IMHO similar bias underlies thinking around economics and the magical hand of the free market economy.
Its also a bias evident in the way some people talk about nature, gardening, etc. E.g. permaculture / natural farming people show it all the time.
> I think there's an ideological bias in our culture that pushes people to believe that intelligent or structured phenomenon inevitably emerge organically and progressively from complex phenomena.
All science points to this being the case, for us. I think the only ones opposed are those that believe in young earth creationism, and only some portion of those that believe in old earth creationism.
Performing arithmetic is one kind of reasoning process. But getting the answers right is not necessarily the same as performing.
If you go on to read, what he's trying to test is the system's ability to even attempt to plan out a problem solving "route". Which it doesn't really do. If it could, it could defer to another system (fancy calculators or solvers) to do the work. But its lack of ability to reason means it can't even be made to do that.
(EDIT: I do think the paper would be stronger if he put the math and formal logic etc problems later. E.g. the problem he puts forward in 3.14, 3.15 etc is more immediately damning as it reflects the "kind" of daily life reasoning that people would expect these systems to be able to perform.)
A langugage model sees a pile of examples with digits and imitates those examples. A reasoning model sees the inner principle behind this pile, and instead of imitating examples, it uses the learnt principle to produce answers.
How do you know this? What's an example of a "reasoning model"?
If the only example is the human mind, for all we know our reasoning capability and ability to discern principles could work much the same way, and it's just some more subtle differences that lead to the differences in capabilities. There are plenty of cases where it appears as though GPT has discerned the "inner principle" behind something to produce answers.
Language models aren't really optimized for imitation though, they're optimized to predict. One means of prediction, which models have found to be effective in many contexts (especially when short on training time/compute), is comparable to imitation.
But this isn't to say that language models are incapable of establishing "inner principles".
This paper is not even reproducible lol. It makes a nonsensical claim it can't even back with results. Look at multiple comments here actually trying them out.
This is absolutely dismissive of the claim to an advanced LLM being capable of "reasoning", or the action of thinking about something in a logical, sensible way.
That is the sum of the paper. Further, the author even goes on to say that if they asked a human these questions, they would conclude the same:
> Of course, even sophisticated human reasoners make mistakes, just like trained singers can hit false notes. But if a human made these mistakes, the ones reported in this article, then I would conclude without any hesitation that they cannot reason. Even if they went on to list a large number of other examples demonstrating impeccable reasoning, I would suspect that other factors (such as rote memorization or cheating) were behind the performance discrepancy.
So the author admits their own biases, which are used to bolster the argument that, if reasoning appears to be lacking in an answer, the system or entity itself is absolutely incapable of any reasoning and something else must explain why it appears to be reasoning in the first place. That's a VERY convenient way of dismissing any evidence that counters the claim.
> The problem is not that GPT-4 can't to the math problems
The problem is the system was not allowed or provided a path to generate a means to arrive at answering the math problem using a language that is better suited to answering analytical questions: code. That the author "denied" the LLM the ability to write code is the issue here, not the model's interface limitations. An analogy would be that if a user is using English and asks a question that requires using Pali, that the LLM would be "prevented" from answering in Pali unless the user said it could understand it. In the same vein, it doesn't make sense to, by default, output Python if the system is unsure if the user understands or knows how to run Python or not.
If you say "I understand Python. Select two random numbers between 1381 and 1453 and multiply them together, reporting the result." the LLM will be capable of answering this question by generating code to solve the problem. This is likely to work every single time any type of question like this is asked, but it does require the user "run" the code.
GPT-4 has the ability to do this with code interpreter, so the question is formed "why did OpenAI choose to allow the user to explicitly indicate code can be written?" The answer likely lies in understanding not everyone can interpret or understand Python, a coding language, and therefore it remains an OPTION for the user to choose first. By not allowing the LLM to show the answers to analytical questions in code, the author "blocks" the LLM's ability to show off reasoning. And by stating that failures constitute a "proving" of the non-reasoning, the author gets what they want.
From a scientific standpoint, a good hypothesis must be formed that can be disproven, as related to reasoning ability. If any experiment is run that is based on a hypothesis that is absolute (this thing can't reason) then the results are not scientific, but instead opinion.
What can I say, we've seen a piiiile of lazy dismissals of LLM work based on examples from arithmetic and string manipulation. They aren't novel or interesting.
It's not about testing the ability to do arithmetic. It's testing the ability to plan a reasoning process / argument.
It's also not the sum of his paper, only one section.
The problem is not that GPT-4 can't to the math problems, it's that it can't reason out how it would begin approach doing a math problem -- it would be totally okay for it to get them wrong, if it was actually making an attempt and could show evidence of working through them,.
Instead it just produces "answers" which are a statistical guess based on other things it has seen on the internet. It's true humans do this, too -- often a first lazy approximation for a problem -- but the key difference is a human can reason out, through interrogation and introspection where it might have gone wrong. GPT-4 appears to be unable to do that.
And worse, my experience with these systems (and the paper's) is that during dialogue about errors they actually rapidly degrade in quality of answer.
I'm probably as bad as a high school student at formal logic, too. But if you sit down with me with a problem and we talk about it, and I'm interested, it will become evident I am capable of reasoning through it, even if I make mistakes. That's not the case with GPT-4.