I hope someone reruns this on o1 and eventually o3. If o1-preview was the start ...

jokethrowaway · 2025-01-01T13:08:50 1735736930

I don't think llm generalise much, that's why they're not creative and can't solve novel problems. It's pattern matching with a huge amount of data.

Study on the topic: https://arxiv.org/html/2406.15992v1

This would explain o1 poor performance with problems with variations. o3 seems to be expensive brute forcing in latent space followed by verification which should yield better results - but I don't think we can call it generalisation.

I think we need to go back to the drawing board.

UniverseHacker · 2025-01-01T15:10:02 1735744202

From firsthand experience, this simply cannot be true. I can give them totally novel and unique physics problems I just made up- that requires tracking the movement of objects through a series of events, and it answers most correctly. Moreover, they find analogies between disparate concepts and fields of study and make useful suggestions based on them- which is arguably the same process as human creativity.

I think ultimately the disconnect is people theorizing about what it can or cannot do with an incorrect mental model of what it is, and then assuming it cannot do things that it can in fact do. The irony of discussions on LLMs is they more showcase the limits of humans ability to reason about novel situations.

red75prime · 2025-01-01T13:40:06 1735738806

Don't worry, there are thousands of researchers at the drawing boards right now.

Culonavirus · 2025-01-01T14:08:44 1735740524

Yeah, because if the AI boom becomes the AI bust, we'll have another 2008-level economic crisis on our hands.

The investments into AI are in the hundreds of billions (maybe even more if you factor in the amount of people studying and researching AI), but the returns are in the tens of billions (if even that).

If you exclude the "growth" coming from the industry sniffing its own farts (e.g. Nvidia selling insane amounts of insanely overpriced GPUs to InsertYourFavAICorp), the actual amount of "useful goods and services" produced (api accesses, chat subscriptions, ai-enabled app growth etc.) are tiny compared to the investment levels.

The AI train appears to have no brakes. A massive crash or AGI are the only options now. Both are going to be bad for average humans.

s1mplicissimus · 2025-01-01T17:19:01 1735751941

the fact that this (and tons of other legitimate critique) got downvoted into greytext speaks so much louder to me than all benchmarks in the world

mupuff1234 · 2025-01-01T13:15:25 1735737325

You're assuming that openAI isn't just gonna add the new questions to the training data.

Lerc · 2025-01-01T13:27:53 1735738073

Their methodology shows they can create an infinite variety of problems.

This is the same thing as synthetic training data.

It doesn't matter if models are trained on the output of the generated data or not. If the model ends up being able to solve newly generated variations, you'd have to admit that it understands the underlying problems.

mupuff1234 · 2025-01-01T13:45:13 1735739113

I think what it shows that it has minimal "understanding" of the problem - otherwise such small variations wouldn't pose a challenge. Training it to handle these specific small variations doesn't change that.

It's good in automation, not understanding.

Lerc · 2025-01-01T14:03:40 1735740220

If it were a complete failure on variations I would be inclined to agree. Instead it was a 30% drop in performance. I would characterise that as limited understanding.

cgriswald · 2025-01-01T14:38:36 1735742316

My guess is that what’s understood isn’t various parts of solving the problem but various aspects of the expected response.

I see this more akin to a human faking their way through a conversation.

CamperBob2 · 2025-01-01T23:51:58 1735775518

I see this more akin to a human faking their way through a conversation.

That works in English class. Try it in a math class and you'll get a much lower grade than ChatGPT will.

sirolimus · 2025-01-01T14:09:53 1735740593

Fully agree with this

sirolimus · 2025-01-01T14:09:34 1735740574

Exactly. The naivity is just sky-high