We are not going to prove or disprove "reasoning" through giving the LLM word problems. LLMs subvert the entire foundation of word problems, which is that words correlate to internal representations and are an indicator of thought processes. Word problems don't have construct validity for testing reasoning in LLMs.
On top of this, there is an almost-certainty that OpenAI has teams of contractors reading as many conversations as possible and hand-fixing bad responses, which makes non-reproducibility a difficult concept when the object of inquiry can change from moment-to-moment.
What the field needs is not more people thinking up word problems but rigorous analysis of the internal behavior of these models and maybe more importantly a functional definition of terms like "reasoning" that everyone can agree on.
Or you could prove it does not reason by adversarially generating correct simple logic puzzles in the same class, with known answers.
Rephrasijg the sentence structure, changing words in the thesaurus, slightly modifying the initial conditions should not produce invalid reasoning explanations or results.
Essentially the sufficiently complex text form of sudoku.
On top of this, there is an almost-certainty that OpenAI has teams of contractors reading as many conversations as possible and hand-fixing bad responses, which makes non-reproducibility a difficult concept when the object of inquiry can change from moment-to-moment.
What the field needs is not more people thinking up word problems but rigorous analysis of the internal behavior of these models and maybe more importantly a functional definition of terms like "reasoning" that everyone can agree on.