It's cheating to the extent that it misrepresents the strength and reasoning ability of the model, to the extent that anyone is going to look at it's chess playing results and incorrectly infer this says anything about how good the model is.
The takeaway here is that if you are evaluating different models for your own use case, the only indication of how useful each may be is to test it on your actual use case, and ignore all benchmarks or anything else you may have heard about it.
It represents the reasoning ability of the model to correctly choose and use a tool... Which seems more useful than a model that can do chess by itself but when you need it to do something else, it keeps playing chess.
Where it’ll surprise people is if they don’t realize it’s using an external tool and expect it to be able to find solutions of similar complexity to non-chess problems, or if they don’t realize this was probably a special case added to the program and that this doesn’t mean it’s, like, learned how to go find and use the right tool for a given problem in a general case.
I agree that this is a good way to enhance the utility of these things, though.
It doesn't take much to recognize a sequence of chess moves. A regex could do that.
If what you want is intelligence and reasoning, there is no tool for that - LLMs are as good as it gets for now.
At the end of the day it either works on your use case, or it doesn't. Perhaps it doesn't work out of the box but you can code an agent using tools and duct tape.
Do you really think it's feasible to maintain and execute a set of regexes for every known problem every time you need to reason about something? Welcome to the 1970s AI winter...
Sure, but how do you train a smarter model that can use tools, without first having a less smart model that can use tools? This is just part of the progress. I don't think anyone claims this is the endgame.
I really don't understand what point you are trying to make.
Your original comment about a model that might "keep playing chess" when you want it to do something else makes no sense. This isn't how LLMs work - they don't have a mind of their own, but rather just "go with the flow" and continue whatever prompt you give them.
Tool use is really no different than normal prompting. Tools are internally configured as part of the hidden system prompt. You're basically just telling the model to use a specific tool in specific circumstances, and the model will have been trained to follow instructions, so it does so. This is just the model generating the most expected continuation as normal.
The takeaway here is that if you are evaluating different models for your own use case, the only indication of how useful each may be is to test it on your actual use case, and ignore all benchmarks or anything else you may have heard about it.