The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
When you ask it a question, it tends to say yes.
So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.
The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.
I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.
I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.
You're right that LLMs don't actually want anything. That said, in reinforcement learning, it's common to describe models as wanting things because they're trained to maximize rewards. It’s just a standard way of talking, not a claim about real agency.
Reinforcement learning, maximise rewards? They work because rabbits like carrots. What does an LLM want? Haven't we already committed the fundamental error when we're saying we're using reinforcement learning and they want rewards?
That sound reasonable to me, but the those companies forget that there's different types of agreeable. There's the LLM approach, similar to the coworker who will answer all your questions about .NET but not stop you from coding yourself into a corner, and then there's the "Let's sit down and review what it actually is that you're doing, because you're asking a fairly large number of disjoint questions right now".
I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.
I think it's your responsibility to control the LLM. Sometimes, I worry that I'm beginning to code myself into a corner, and I ask if this is the dumbest idea it's ever heard and it says there might be a better way to do it. Sometimes I'm totally sceptical and ask that question first thing. (Usually it hallucinates when I'm being really obtuse though, and in a bad case that's the first time I notice it.)
> I think it's your responsibility to control the LLM.
Yes. The issue here is control and NLP is a poor interface to exercise control over the computer. Code on the other hand is a great way. That is the whole point of skepticism around LLM in software development.
Yeah, and they probably have more "agreeable" stuff in their corpus simply because very disagreeable stuff tend to be either much shorter or a prelude to a flamewar.
This rings true. What I notice is that the longer i let Claude work on some code for instance, the more bullshit it invents. I usually can delete about 50-60% of the code & tests it came up with.
And when you ask it to 'just write a test' 50/50 it will try to run it, fail on some trivial issues, delete 90% of your test code and start to loop deeper and deeper into the rabit hole of it's own halliciations.
Every time someone argues for the utility of LLMs in software development by saying you need to be better at prompting, or add more rules for the LLM on the repository, they are making an argument against using NLP in software development.
The whole point of code is that it is a way to be very specific and exact and to exercise control over the computer behavior. The entire value proposition of using an LLM is that it is easier because you don't need to be so specific and exact. If then you say you need to be more specific and exact with the prompting, you are slowly getting at the fact that using NLP for coding is a bad idea.
It's, in many ways, the same problem as having too many "yes men" on a team at work or in your middle management layer. You end up getting wishy-washy, half-assed "yes" answers to questions that everyone would have been better off if they'd been answered as "no" or "yes, with caveats" with predictable results.
In fact, this might be why so many business executives are enamored with LLMS/GenAI: It's a yes-man they don't even have to employ, and because they're not domain experts, as per usual, they can't tell that they're being fed a line of bullshit.
> The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
umm, it seems to me that it is this (tfa):
But I would nevertheless like to submit, based off of internal
benchmarks, and my own and colleagues' perceptions using these models,
that whatever gains these companies are reporting to the public, they
are not reflective of economic usefulness or generality.
and then couple of lines down from the above statement, we have this:
So maybe there's no mystery: The AI lab companies are lying, and when
they improve benchmark results it's because they have seen the answers
before and are writing them down.
[this went way outside the edit-window and hence a separate comment]
imho, state of varying experience with llm's can aptly summed in this poem by Mr. Longfellow
There was a little girl,
Who had a little curl,
Right in the middle of her forehead.
When she was good,
She was very good indeed,
But when she was bad she was horrid.
When you ask it a question, it tends to say yes.
So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.
The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.
I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.
I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.