[Back in the 1980s] You should try to play chess yourself, and then tell me you ...

jhbadger · 2025-02-09T01:11:44 1739063504

While I agree that we should be skeptical about the reasoning capabilities of LLMs, comparing them to chess programs misses the point. Chess programs were specifically created to play chess. That's all they could do. They couldn't generalize and play other board games, even related games like Shogi and Xiangqi, the Japanese and Chinese versions of chess. LLMs are amazing at being able to do things they never were programmed to do simply by accident.

gessha · 2025-02-09T03:57:28 1739073448

Are they though? They’ve been shown to generalize poorly to tasks where you switch up some of the content.

jhbadger · 2025-02-11T03:13:57 1739243637

Here's an example. I'm interested in obscure conlangs like Volapük. I can feed a LLM (which had no idea what Volapük was), a English-language grammar of Volapük and suddenly it can translate to and from the language. That couldn't work with a chess program. I couldn't give it a rule book of Shogi and have it play that.

refulgentis · 2025-02-09T06:12:10 1739081530

That's not true

gessha · 2025-02-09T17:24:57 1739121897

They don’t generalize well on logic puzzles

https://huggingface.co/blog/yuchenlin/zebra-logic

refulgentis · 2025-02-09T23:40:23 1739144423

Apologies, I was a bit curt because this is a well-worn interaction pattern.

I don't mean anything by the following either, other than, the goalposts have moved:

- This doesn't say anything about generalization, nor does it claim to.

- The occurrences of the prefix general* refer to "Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?"

- This specific suggestion was accomplished publicly to some acclaim in September

- To wit, the benchmark the article is centered around hasn't been updated since since September, because the preview of the large model accomplishing that blew it out of the water, 33% on all at the time, 71%: https://huggingface.co/spaces/allenai/ZebraLogic

- these aren't supposed to be easy, they're constraint satisfaction problems, which they point out are used on the LSAT

- The major other form of this argument is the Apple paper, which shows a 5 point drop from 87% to 82% on a home-cooked model