I agree with your overall message - rapid growth appears to encourage competitio...

garrickvanburen · 2025-05-02T00:33:45 1746146025

My current hypothesis: the more familiar you are with a topic the worse the results from any LLM.

danw1979 · 2025-05-02T09:17:15 1746177435

Amen to this. As soon as you ask an LLM to explain something in detail that you’re a domain expert in, that’s when you notice the flaws.

startupsfail · 2025-05-02T15:42:09 1746200529

Yes, it’s particularly bad when the information found on the web is flawed.

For example, I’m not a domain expert, but I was looking for an RC motor for a toy project and OpenAI had happily tried to source a few, with Deep Research. Only the best candidate it had picked contained an obvious typo in the motor spec (68 grams instead of 680 grams), which is just impossible for a motor of specified dimensions.

parineum · 2025-05-02T16:20:30 1746202830

> Yes, it’s particularly bad when the information found on the web is flawed.

It's funny you say that because I was going to echo your parents sentiment and point out it's exactly the same with any news article you read.

The majority if content these LLMs are consuming is not from domain experts.

danw1979 · 2025-05-10T11:22:45 1746876165

Right, but LLMs are also consuming AWS product documentation and Terraform language docs, some things I have read a lot of and they’re often badly wrong on things from both of those domains, which are really easy for me to spot.

This isn’t just “shit in, shit out”. Hallucination is real and still problematic.

91bananas · 2025-05-02T17:06:56 1746205616

I had it generate a baseball lineup the other day, it printed out a list of the 13 kids names, then said (12 players). Just straight up miscounted what it was doing, throwing a wrench to everything else it was doing beyond that point.

mac-mc · 2025-05-02T02:20:24 1746152424

He was saying that 3.5 is better than 3.7 on the same topic he knows well tho.

jeswin · 2025-05-02T02:03:35 1746151415

> My current hypothesis: the more familiar you are with a topic the worse the results from any LLM.

That's not really true, since your prompts are also getting better. Better input leads to better output remains true, even with LLMs (when you see it as a tool).

franga2000 · 2025-05-02T08:02:24 1746172944

Being more familiar with the topic definitely doesn't always make your prompts better. For a lot of things it doesn't really change (explain X, compare X and Y...) - and this is what is being discussed it. For giving "building" instructions (like writing code) it helps a bit, but even if you know exactly what you want it to write, getting it to do that is pretty much trial and errror (too much detail makes it follow word-for-word and produce bad code, too little and it misses important parts or makes dumb mistakes).

jm547ster · 2025-05-02T09:14:37 1746177277

The opposite may be true, the more effective the model the lazier the prompting as it can seemingly handle not being micromanaged as with earlier versions.

subpixel · 2025-05-03T23:27:13 1746314833

The more familiar you are with the state of “Jira hygiene” in the megacorp environment, the less hope you have that LLMs will be able to make sense of things.

That said, the “AI all the things” mandates could be the lever that ultimately accomplishes what 100+ PjMs couldn’t - making people write issues as if they really mattered. Because garbage in, garbage out.

bsenftner · 2025-05-02T12:52:52 1746190372

It is like this with expert humans too. Which is why, no matter what, we will continue to require expert humans not just "in the loop" but as the critical cogs that are the loop itself, just as it as always been. However, this time around those people will have AI augmentation, and be intellectually athletes of a nature our civilization has never seen.

simsla · 2025-05-02T14:36:51 1746196611

I always tell people to trust the LLM to the same extent as an intern. Avoid giving it tasks you cannot verify the correctness of.

user_7832 · 2025-05-02T08:20:39 1746174039

That is certainly the case in niche topics where published information is lacking, or needs common sense to synthesize proper outputs [1].

However in this specific example, I don't remember if it was chatgpt or gemini or 3.5 Haiku but the other(s) explained it well enough. I think I re-asked 3.5 Haiku at a later point of time, and to my complete non-surprise, it gave an answer that was quite decent.

1 - For example, the field of DIY audio - which was funnily enough the source of my question. I'm no speaker designer, but combining creativity with engineering basics/rules of thumb seems to be something LLms struggle with terribly. Ask them to design a speaker and they come up with the most vanilla, tired, textbook design - despite several existing market products that are already so much ahead/innovative.

I'm confident that if you asked an LLM an identical question for which there is more discourse - eg make an interesting/innovative phone - you'd get relatively much better results.

terminalcommand · 2025-05-02T14:22:39 1746195759

I built open baffle speakers based on measurements and discussion I had with Claude. I think it is really good.

I am a novice, maybe that's why I liked it.

eru · 2025-05-02T10:25:39 1746181539

Not really. I'm getting pretty good Computer Science theory out of Gemini and even ChatGPT.

tiberriver256 · 2025-05-01T22:43:38 1746139418

3.7 did score higher in coding benchmarks but in practice 3.5 is much better at coding. 3.7 ignores instructions and does things you didn't ask it to do.

sannee · 2025-05-02T10:12:12 1746180732

I suspect that is precisely why it got better at coding benchmarks.

spaceman_2020 · 2025-05-02T04:56:55 1746161815

3.7 is too overactive

I prefer Gemini 2.5 pro for all code now

hombre_fatal · 2025-05-02T14:02:22 1746194542

Gemini 2.5 Pro has solved problems that Claude 3.7 cannot, so I use it for the hard stuff.

But Gemini is at least as overactive as Claude, sometimes even more overactive when it comes to something like comment spam.

Of course, this can be fixed with prompting. And sometimes it feels sheepish complaining about the machine god doing most of my chore work that didn't even exist a couple years ago.

conception · 2025-05-02T12:46:02 1746189962

2.5 is my “okay Claude can’t get it” but first I check my “bank account” to see if I can afford it.

ralusek · 2025-05-02T13:07:47 1746191267

Isn’t 2.5 pro significantly cheaper?

yunwal · 2025-05-02T15:26:16 1746199576

They're the same price, and Gemini has a large free tier.

conception · 2025-05-03T01:32:30 1746235950

Not when you’re doing 500k tokens per query.

UncleEntity · 2025-05-02T02:17:22 1746152242

I think it just does that to eat up your token quota and get you to upgrade.

Like, ask it a simple question and it comes up with a full repo, complete with a README and a Makefile, when all you wanted to know was how efficient a particular algorithm would be in the included code.

Can't wait until the add research to the Pro plan because, you know, I have questions...

vineyardmike · 2025-05-02T03:03:31 1746155011

> I think it just does that to eat up your token quota and get you to upgrade.

If you pay for a subscription then they don’t have an incentive to use more tokens for the same answer.

It’s definitely because feedback from people has “taught” it that more boilerplate is better. It’s the same reason ChatGPT is annoyingly complementary.

suyash · 2025-05-02T08:07:32 1746173252

That has been the most annoying thing about it, so glad not paying for it anymore.

danw1979 · 2025-05-02T09:19:48 1746177588

Can’t you still use Sonnet 3.5 anyway ? or is that a paying subscriber feature only ?

csomar · 2025-05-02T16:29:49 1746203389

Plateauing overall but apparently you can gain in certain directions while you lose on some. I've written an article a while back that current models are not that far from GPT-3.5: https://omarabid.com/gpt3-now

3.7 is definitively better at coding but you feel it lost a bit of maneuverability at other domains. For someone who wants code generated, it doesn't matter but I've found myself using DeepSeek first and then getting code output by 3.7.

fastball · 2025-05-02T03:06:21 1746155181

Seems clear to me that Claude 3.7 suffers from overfitting, probably due to Anthropic seeing that 3.5 was a smash hit in the LLM coding space and deciding their North star for 3.7 should be coding benchmarks (which, like all benchmarks, do not properly capture the process of real-world coding).

If it was actually good they would've named it 4.0, the fact that they went from 3.5 to 3.7 (weird jump) speaks volumes imo.

snewman · 2025-05-02T05:25:55 1746163555

The numbering jump is because there was "Claude 3.5" and then "Claude 3.5 (new)" and they decided to retroactively stop the madness and rename the later to 3.6 (which is what everyone was calling it anyway).

airstrike · 2025-05-01T18:23:12 1746123792

I too like 3.5 better than 3.7 and I use it pretty often. It's like 3.7 is better in 2 metrics but worse in 10 different ones

joshstrange · 2025-05-01T19:05:44 1746126344

I use Claude mostly for coding/technical things and something about 3.7 does not feel like an upgrade. I haven't gone back to 3.5 (mostly started using Gemini Pro 2.5 instead).

I haven't been able to use Claude research yet (it's not rolled out to the Pro tier) but o1 -> o3 deep research was a massive jump IMHO. It still isn't perfect but o1 would often give me trash results but o3 deep research actually starts to be useful.

3.5->3.7 (even with extended thinking) felt like a nothingburger.

mattlutze · 2025-05-01T21:01:34 1746133294

The expectation that one model be top marks for all things is, imo, asking too much.