Certainly an interesting result, but remember that a single paper doesn’t prove anything. This will no doubt be something studied very extensively and change over time as tools develop.
Personally, I find the current tools don’t work great for large existing codebases and complex tasks. But I’ve found they can help me quickly make small scripts to save me time.
I know, it’s not the most glamorous application, but it’s what I find useful today. And I have confidence the tools will continue to improve. They hardly even existed a few years ago.
The AI tooling churn is so fast that by the time a study comes out people will be able to say "well they were using an older tool" no matter what tool that the study used.
It's the eternal future.
"AI will soon be able to...".
There's an entire class of investment scammers that string along their marks,
claiming that the big payoff is just around corner while they fleece the victim with the death of a thousand cuts.
What is the problem with this, exactly? It's a valid criticism of the study (when applied to current agentic coding practices). That the pace of progress is so fast sucks for researchers, in some sense, but this is the reality right now.
Not really. Chatting with a llm was cutting edge for 3 years it’s only within the last 8-10 months with Claude code and Gemini cli do we have the next big change in how we interact with llms
I can't speak to how they're technically different, but in practice, Cursor was basically useless for me, and Claude Code works well. Even with Cursor using Claude's models.
If there are paradigm-shattering improvements every six months, every single study that is ever released will be "behind" or "use an older tool." In six months when a study comes out using Claude Code, people dissatisfied with it will be able to point to the newest hotness, ad infinitum.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...