Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yann LeCun's "Hoisted by their own GPTards" is fantastic.


While Yann is clearly brilliant, and has a deeper understanding of the roots of the filed than many of us mortals, I think he's been on a debbie downer trend lately, and more importantly, some of his public stances have been proven wrong in mere months / years after he made them.

I remember a public talk, where he was on the stage with some young researcher from MS. (I think it was one of the authors of the "sparks of brilliance in gpt4" paper, but not sure).

Anyway, throughout that talk he kept talking above the guy, and didn't seem to listen, even though he obviously didn't try the "raw", "unaligned" model that the folks at MS were talking about.

And he made 2 big claims:

1) LLMs can't do math. He went on to "argue" that LLMs trick you with poetry that sounds good, but is highly subjective, and when tested on hard verifiable problems like math, they fail.

2) LLMs can't plan.

Well, merely one year later, here we are. AIME is saturated (with tool use), gold at IMO, and current agentic uses clearly can plan (and follow up with the plan, re-write parts, finish tasks, etc etc).

So, yeah, I'd take everything any one singular person says with a huge grain of salt. No matter how brilliant said individual is.

Edit: oh, and I forgot another important argument that Yann made at that time:

3) because of the nature of LLMs, errors compound. So the longer you go in a session, the more errors accumulate so they devolve in nonsense.

Again, mere months later the o series of models came out, and basically proved this point moot. Turns out RL + long context mitigate this fairly well. And a year later, we have all SotA models being able to "solve" problems 100k+ tokens deep.


> LLMs can't do math. He went on to "argue" that LLMs trick you with poetry that sounds good, but is highly subjective, and when tested on hard verifiable problems like math, they fail.

They really can’t. Token prediction based on context does not reason. You can scramble to submit PRs to ChatGPT to keep up with the “how many Rs in blueberry” kind of problems but it’s clear they can’t even keep up with shitposters on reddit.

And your 2nd and third point about planning and compounding errors remain challenges.. probably unsolvable with LLM approaches.


> They really can’t. Token prediction based on context does not reason.

Debating about "reasoning" or not is not fruitful, IMO. It's an endless debate that can go anywhere and nowhere in particular. I try to look at results:

https://arxiv.org/pdf/2508.15260

Abstract:

> Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages modelinternal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.


> Debating about "reasoning" or not is not fruitful, IMO.

Thats kind of the whole need isn’t it? Humans can automate simple tasks very effectively and cheaply already. If I ask my pro versions of LLM what the Unicode value of a seahorse is, and it shows a picture of a horse and gives me the Unicode value for a third completely related animal then it’s pretty clear it can’t reason itself out of a wet paper bag.


Sorry perhaps I worded that poorly. I meant debating about if context stuffing is or isn't "reasoning". At the end of the day, whatever RL + long context does to LLMs seems to provide good results. Reasoning or not :)


Well that’s my point and what I think the engineers are screaming at the top of their lungs these days.. that it’s net negative. It makes a really good demo but hasn’t won anything except maybe translating and simple graphics generation.


> You can scramble to submit PRs to ChatGPT to keep up with the “how many Rs in blueberry” kind of problems but it’s clear they can’t even keep up with shitposters on reddit.

Nobody does that. You can't "submit PRs" to an LLM. Although if you pick up new pretraining data you do get people discussing all newly discovered problems, which is a bit of a neat circularity.

> And your 2nd and third point about planning and compounding errors remain challenges.. probably unsolvable with LLM approaches.

Unsolvable in the first place. "Planning" is GOFAI metaphor-based development where they decided humans must do "planning" on no evidence and therefore if they coded something and called it "planning" it would give them intelligence.

Humans don't do or need to do "planning". Much like they don't have or need to have "world models", the other GOFAI obsession.


> LLMs can't do math.

Ignoring conversations about 'reasoning', at a fundamental level LLMs do not 'do math' in the way that a calculator or a human does math. Sure we can train bigger and bigger models that give you the impression of this but there are proofs out there that with increased task complexity (in this case multi-digit multiplication) eventually the probability of incorrect predictions converges to 1 (https://arxiv.org/abs/2305.18654)

> And your 2nd and third point about planning and compounding errors remain challenges.. probably unsolvable with LLM approaches.

The same issue applies here, really with any complex multi-step problem.

> Again, mere months later the o series of models came out, and basically proved this point moot. Turns out RL + long context mitigate this fairly well. And a year later, we have all SotA models being able to "solve" problems 100k+ tokens deep.

If you go hands on in any decent size codebase with an agent session length and context size become noticeable issues. Again, mathematically error propagation eventually leads to a 100% chance of error. Yann isn't wrong here, we've just kicked the can a little further down the road. What happens at 200k+ tokens? 500k+ tokens? 1M tokens? The underlying issue of a stochastic system isn't addressed.

>While Yann is clearly brilliant, and has a deeper understanding of the roots of the filed than many of us mortals, I think he's been on a debbie downer trend lately

As he should be. Nothing he said was wrong at a fundamental level. The transformer architecture we have now cannot scale with task complexity. Which is fine, by nature it was not designed for such tasks. The problem is that people see these models work on a subset of small scope complex projects and make claims that go against the underlying architecture. If a model is 'solving' complex or planning tasks but then fails to do similar tasks at a higher complexity it's a sign that there is no underlying deterministic process. What is more likely: the model is genuinely 'planning' or 'solving' complex tasks, or that the model has been trained with enough planning and task related examples that it can make a high probability guess?

> So, yeah, I'd take everything any one singular person says with a huge grain of salt. No matter how brilliant said individual is.

If anything, a guy like Yann with a role such as his at a Mag7 company being realistic (bearish if you are a LLM evangelist) about what the transformer architecture can do is a relief. I'm more inclined to listen to him than a guy like Altman who touts LLMs as the future of humanity meanwhile is path to profitability is AI Tik-Tok, sex chatbots, and a third party way to purchase things from Walmart during a recession.


> AIME is saturated (with tool use) [...]

But isn't tool use kinda the crux here?

Correct me if I'm mistaken, but wasn't the argument back then on whether LLMs could solve maths problems without e.g. writing python to solve? Cause when "Sparks of AGI" came out in March, prompting gpt-3.5-turbo to code solutions to assist solving maths problems over just solving them directly was already established and seemed like the path forward. Heck, it is still the way to go, despite major advancements.

Given that, was he truly mistaken on his assertions regarding LLMs solving maths? Same for "planning".


AIME was saturated with tool use (i.e. 99%) for SotA models, but pure NL, no tool still perform "unreasonably well" on the task. Not 100% but still within 90%. And with lots of compute it can reach 99% as well, apparently [1] (@512 rollouts, but still)

[1] - https://arxiv.org/pdf/2508.15260


Pretty sure you can fill a room with serious researchers that at the very least will doubt about 2) being solved with LLMs, especially when talking about formal planning with pure LLMs and without a planning framwork.

PS: So just we're clear: formal planning in AI </> making a coding plan in Cursor.


> with pure LLMs and without a planning framwork.

Sure, but isn't that moving the goalposts? Why shouldn't we use LLMs + tools if it works? If anything it shows that the early detractors weren't even considering this could work. Yann in particular was skeptical that long-context things can happen in LLMs at all. We now have "agents" that can work a problem for hours, with self context trimming, planning to md files, editing those plans and so on. All of this just works, today. We used to dream about it a year ago.


> Why shouldn't we use

So weird that you immediately move the goalposts after accusing somebody of moving the goalposts. Nobody on the planet told you not to use "LLMs + tools if they work." You've moved onto an entirely different discussion with a made-up person.

> All of this just works, today.

Also, it definitely doesn't "just work." It slops around, screws up, reinserts bugs, randomly removes features, ignores instructions, lies, and sometimes you get a lucky result or something close enough that you can fix up. Nothing that should be in production.

Not that they're not very cool and very helpful in a lot of ways. But I've found them more helpful in showing me how they would do something, and getting me so angry that they nerd-snipe me into doing it correctly. I have to admit, 1) however, that sometimes I'm not sure that I'd have gotten there if I hadn't seen it not getting there, and 2) sometimes "doing it correctly" involves dumping the context and telling it almost exactly how I want something implemented.


> Sure, but isn't that moving the goalposts?

It can be considered as that, sure, but anytime I see Lecun talking about this, he does recognize that you can patch your way around LLMs, the point is that you are going to hit limits eventually anyways. Specific planning benchmarks like Blockworld and the like show that LLMs (with frameworks) hit limits when they're exposed to out-of-distribution problems, and that's a BIG problem.

> We now have "agents" that can work a problem for hours, with self context trimming, planning to md files, editing those plans and so on. All of this just works, today. We used to dream about it a year ago.

I use them everyday but I still woulnd't really let them work for hours in greenfield projects. And we're seeing big vibe coders like Karpathy say the same.


> Sure, but isn't that moving the goalposts? Why shouldn't we use LLMs + tools if it works?

Personally i do not see it like that at all as one is referring to LLMs specifically while the other is referring to LLMs plus a bunch of other stuff around them.

It is like person A claiming that GIF files can be used to play Doom deathmatches, person B responding that, no, a GIF file cannot start a Doom deathmatch, it is fundamentally impossible to do so and person A retorting that since the GIF format has a provision for advancing a frame on user input, a GIF viewer can interpret that input as the user wanting to launch Doom in deathmatch mode - ergo, GIF files can be used to play Doom deathmatches.


At the end of the day LLM + tools is asking the LLM to create a story with very specific points where "tool calls" are parts of the story, and "tool results" are like characters that provide context. The fact that they can output stories like that, with enough accuracy to make it worthwhile is, IMO, proof that they can "do" whatever we say they can do. They can "do" math by creating a story where a character takes NL and invokes a calculator, and another character provides the actual computation. Cool. It's still the LLM driving the interaction. It's still the LLM creating the story.


I think you have that last part backwards, it is not the LLM driving the interaction, it is the program that uses the LLM to generate the instructions that does the actual driving - that is the bit that makes the LLM start doing things. Though that is just splitting hairs.

The original point was about the capabilities LLMs themselves since the context was about the technology itself, not what you can do by making them part of a larger system that combines LLMs (perhaps more than one) with other tools.

Depending on the use case and context this distinction may or may not matter, e.g. if you are trying to sell the entire system, it probably is not any more important how the individual parts of the system work than what libraries you used to make the software.

However it can be important in other contexts, like evaluating the abilities of LLMs themselves.

For example i have written a script on my PC that my window manager calls to grab whatever text i have selected on whatever application i'm running and passes it to a program i've written in llama.cpp to load Mistral Small with a prompt that makes it check for spelling and grammar mistakes which in turn produces some script-readable input that another script displays in a window.

This, in a way, is an entire system. This system helps me find grammar and spelling mistakes in the text i have selected when i'm writing documents where i care about finding such mistakes. However it is not Mistral Small that has the functionality of finding grammar and spelling mistakes in my selected text, it only provides the part that does the text checking, the rest is done by other external non-LLM pieces. An LLM cannot intercept keystrokes in my computer, it cannot grab my selected text nor can create a window on my desktop, it doesn't even understand these concepts. In a way this can be thought as a limitation from the perspective of the end result i want, but i work around it with the other software i have attached to it.


I might be missing context here, but I'm surprised to see Yann using language that plays on 'retard.'

That seems out of character for him - more like something I'd expect from Elon Musk. What's the context I'm missing?


I don’t think it’s a wordplay with the r-word, but rather a reference to the famous Shakespeare quote: “Hoist with his own petard”. It’s become an English proverb. (A petard is a smallish bomb)


From péter, to fart.

Possibly entered the language as a saying due to Shakespeare being scurrilous.


It's a play on the word petard


I found this background useful as a non-native speaker: https://en.wikipedia.org/wiki/Hoist_with_his_own_petard


Hoist (thrown in the air) by your own petard (bomb) is a common phrase.


You have been Hoisted with your own retard




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: