I agree with your overall message - rapid growth appears to encourage competition and forces companies to put their best foot forward.
However, unfortunately, I cannot shower much praise on Claude 3.7. And if you (or anyone) asks why - 3.7 seems much better than 3.5, surely? - Then I’m moderately sure that you use Claude much more for coding than for any kind of conversation. In my opinion, even 3.5 Haiku (which is available for free during high loads) is better than 3.7 Sonnet.
Here’s a simple test. Try asking 3.7 to intuitively explain anything technical - say, mass dominated vs spring dominated oscillations. I’m a mechanical engineer who studied this stuff and I could not understand 3.7’s analogies.
I understand that coders are the largest single group of Claude’s users, but Claude went from being my most used app to being used only after both chatgpt and Gemini, something that I absolutely regret.
Yes, it’s particularly bad when the information found on the web is flawed.
For example, I’m not a domain expert, but I was looking for an RC motor for a toy project and OpenAI had happily tried to source a few, with Deep Research. Only the best candidate it had picked contained an obvious typo in the motor spec (68 grams instead of 680 grams), which is just impossible for a motor of specified dimensions.
Right, but LLMs are also consuming AWS product documentation and Terraform language docs, some things I have read a lot of and they’re often badly wrong on things from both of those domains, which are really easy for me to spot.
This isn’t just “shit in, shit out”. Hallucination is real and still problematic.
I had it generate a baseball lineup the other day, it printed out a list of the
13 kids names, then said (12 players). Just straight up miscounted what it was doing, throwing a wrench to everything else it was doing beyond that point.
> My current hypothesis: the more familiar you are with a topic the worse the results from any LLM.
That's not really true, since your prompts are also getting better. Better input leads to better output remains true, even with LLMs (when you see it as a tool).
Being more familiar with the topic definitely doesn't always make your prompts better. For a lot of things it doesn't really change (explain X, compare X and Y...) - and this is what is being discussed it. For giving "building" instructions (like writing code) it helps a bit, but even if you know exactly what you want it to write, getting it to do that is pretty much trial and errror (too much detail makes it follow word-for-word and produce bad code, too little and it misses important parts or makes dumb mistakes).
The opposite may be true, the more effective the model the lazier the prompting as it can seemingly handle not being micromanaged as with earlier versions.
The more familiar you are with the state of “Jira hygiene” in the megacorp environment, the less hope you have that LLMs will be able to make sense of things.
That said, the “AI all the things” mandates could be the lever that ultimately accomplishes what 100+ PjMs couldn’t - making people write issues as if they really mattered. Because garbage in, garbage out.
It is like this with expert humans too. Which is why, no matter what, we will continue to require expert humans not just "in the loop" but as the critical cogs that are the loop itself, just as it as always been. However, this time around those people will have AI augmentation, and be intellectually athletes of a nature our civilization has never seen.
That is certainly the case in niche topics where published information is lacking, or needs common sense to synthesize proper outputs [1].
However in this specific example, I don't remember if it was chatgpt or gemini or 3.5 Haiku but the other(s) explained it well enough. I think I re-asked 3.5 Haiku at a later point of time, and to my complete non-surprise, it gave an answer that was quite decent.
1 - For example, the field of DIY audio - which was funnily enough the source of my question. I'm no speaker designer, but combining creativity with engineering basics/rules of thumb seems to be something LLms struggle with terribly. Ask them to design a speaker and they come up with the most vanilla, tired, textbook design - despite several existing market products that are already so much ahead/innovative.
I'm confident that if you asked an LLM an identical question for which there is more discourse - eg make an interesting/innovative phone - you'd get relatively much better results.
3.7 did score higher in coding benchmarks but in practice 3.5 is much better at coding. 3.7 ignores instructions and does things you didn't ask it to do.
Gemini 2.5 Pro has solved problems that Claude 3.7 cannot, so I use it for the hard stuff.
But Gemini is at least as overactive as Claude, sometimes even more overactive when it comes to something like comment spam.
Of course, this can be fixed with prompting. And sometimes it feels sheepish complaining about the machine god doing most of my chore work that didn't even exist a couple years ago.
I think it just does that to eat up your token quota and get you to upgrade.
Like, ask it a simple question and it comes up with a full repo, complete with a README and a Makefile, when all you wanted to know was how efficient a particular algorithm would be in the included code.
Can't wait until the add research to the Pro plan because, you know, I have questions...
> I think it just does that to eat up your token quota and get you to upgrade.
If you pay for a subscription then they don’t have an incentive to use more tokens for the same answer.
It’s definitely because feedback from people has “taught” it that more boilerplate is better. It’s the same reason ChatGPT is annoyingly complementary.
Plateauing overall but apparently you can gain in certain directions while you lose on some. I've written an article a while back that current models are not that far from GPT-3.5: https://omarabid.com/gpt3-now
3.7 is definitively better at coding but you feel it lost a bit of maneuverability at other domains. For someone who wants code generated, it doesn't matter but I've found myself using DeepSeek first and then getting code output by 3.7.
Seems clear to me that Claude 3.7 suffers from overfitting, probably due to Anthropic seeing that 3.5 was a smash hit in the LLM coding space and deciding their North star for 3.7 should be coding benchmarks (which, like all benchmarks, do not properly capture the process of real-world coding).
If it was actually good they would've named it 4.0, the fact that they went from 3.5 to 3.7 (weird jump) speaks volumes imo.
The numbering jump is because there was "Claude 3.5" and then "Claude 3.5 (new)" and they decided to retroactively stop the madness and rename the later to 3.6 (which is what everyone was calling it anyway).
I use Claude mostly for coding/technical things and something about 3.7 does not feel like an upgrade. I haven't gone back to 3.5 (mostly started using Gemini Pro 2.5 instead).
I haven't been able to use Claude research yet (it's not rolled out to the Pro tier) but o1 -> o3 deep research was a massive jump IMHO. It still isn't perfect but o1 would often give me trash results but o3 deep research actually starts to be useful.
3.5->3.7 (even with extended thinking) felt like a nothingburger.
However, unfortunately, I cannot shower much praise on Claude 3.7. And if you (or anyone) asks why - 3.7 seems much better than 3.5, surely? - Then I’m moderately sure that you use Claude much more for coding than for any kind of conversation. In my opinion, even 3.5 Haiku (which is available for free during high loads) is better than 3.7 Sonnet.
Here’s a simple test. Try asking 3.7 to intuitively explain anything technical - say, mass dominated vs spring dominated oscillations. I’m a mechanical engineer who studied this stuff and I could not understand 3.7’s analogies.
I understand that coders are the largest single group of Claude’s users, but Claude went from being my most used app to being used only after both chatgpt and Gemini, something that I absolutely regret.