While I agree that we should be skeptical about the reasoning capabilities of LLMs, comparing them to chess programs misses the point. Chess programs were specifically created to play chess. That's all they could do. They couldn't generalize and play other board games, even related games like Shogi and Xiangqi, the Japanese and Chinese versions of chess. LLMs are amazing at being able to do things they never were programmed to do simply by accident.
Here's an example. I'm interested in obscure conlangs like Volapük. I can feed a LLM (which had no idea what Volapük was), a English-language grammar of Volapük and suddenly it can translate to and from the language. That couldn't work with a chess program. I couldn't give it a rule book of Shogi and have it play that.
Apologies, I was a bit curt because this is a well-worn interaction pattern.
I don't mean anything by the following either, other than, the goalposts have moved:
- This doesn't say anything about generalization, nor does it claim to.
- The occurrences of the prefix general* refer to "Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?"
- This specific suggestion was accomplished publicly to some acclaim in September
- To wit, the benchmark the article is centered around hasn't been updated since since September, because the preview of the large model accomplishing that blew it out of the water, 33% on all at the time, 71%: https://huggingface.co/spaces/allenai/ZebraLogic
- these aren't supposed to be easy, they're constraint satisfaction problems, which they point out are used on the LSAT
- The major other form of this argument is the Apple paper, which shows a 5 point drop from 87% to 82% on a home-cooked model
You should try to play chess yourself, and then tell me you think these things aren't intelligent.