Why do people keep saying that Claude3 has been nerfed? Their CTO has said on Twitter multiple times that not a single byte has been changed since its launch, so I'm curious why I keep hearing this.
edit: having trouble finding the tweet I saw recently, it might have been from their lead engineer and not the CTO.
I suspect that there is some psychological effect going on where people adjust their expectations and start to be more open to noticing flaws after working with it for a while. Seems to be a recurring thing with most models.
It's likely true that they didn't change the model, same for the many claims of GPT-4 getting worse. But they do keep iterating a lot on the "safety" layers on top: classifiers to detect dangerous requests, the main system prompt...
But I also think it's partially a psychological phenomenon, just people getting used to the magic and finding more bad edge-cases as it is used more.
While I do think that many claims of GPT4 getting worse were subjective and incorrect, there certainly was an accidental nerfing of at least ChatGPT Plus, as confirmed by OpenAI releasing an update some months ago specifically acknowledging that it had become "more lazy" and the update was to rectify it.
(I think it was just the settings for how ChatGPT calls the GPT4 model, and not affecting use of GPT4 by API, though I may be misremembering.)
It is 100% possible for performance regressions to occur by changing the model pipeline and not the model itself. A system prompt is a part of said pipeline.
Absolutely! That was covered in the tweet link. If you're suggesting they're lying*, I'm happy to extract it and check.
* I don't think you are! I've looked up to you a lot over last year on LLMs btw, just vagaries of online communication, can't tell if you're ignoring the tweet & introducing me to idea of system prompts, or you're suspicious it changed recently. (in which case, I would want to show off my ability to extract system prompt to senpai :)
I was agreeing with the tweet and think Anthropic is being honest, my comment was more for posterity since not many people know the difference between models and pipelines.
You're right, that's a good point. It is possible to make a model dumber via quantization.
But even F16 -> llama.cpp Q4 (3.8 bits) has negligible perplexity loss.
Theoratically, a leading AI lab could quantize absurdly poorly after the initial release where they know they're going to have huge usage.
Theoratically, they could be lying even though they said nothing changed.
At that point, I don't think there's anything to talk about. I agree both of those things are theoratically possible. But it would be very unusual, 2 colossal screwups, then active lying, with many observers not leaking a word.
Why would the CTO/lead engineer admit that they nerfed the model even if they did? It’s all closed, how does admitting it benefit them? I would much rather trust the people using it everyday.
Beyond that, to people who interact with the models regularly the "nerf" issue is pretty obvious. It was pretty clear when a new model rollout caused ChatGPT4 to try and stick to the "leadup, answer, explanation" response model and also start to get lazy about longer responses.
I use claude3 opus daily and I haven't noticed a change in its outputs, I think it's more likely that there's a discontinuity in the inputs the user is providing to claude which is tipping it over a threshold into a response type they find incorrect.
When GPT4 got lobotomized, you had to work hard to avoid the new behavior, it popped up everywhere. People claiming claude got lobotomized seem to be cherry picking example.
Oh my bad, sorry, I misinterpreted your previous comment as meaning "it was obvious with GPT4 and therefore if people say the same about Claude 3 it must equally be obvious and true", rather than what you meant which was half the opposite.
edit: having trouble finding the tweet I saw recently, it might have been from their lead engineer and not the CTO.