Not sure why everyone rates this. It’s full of very confidently made statements like “the AI has no ground truth” (obviously it does, it has ingested every paper ever), it “can’t reason logically” which seems like a stretch if you ever read the CoT of a frontier reasoning model and “can’t explain how they arrived at conclusions” where - I mean just try it yourself with o1, go as deep as you like asking how it arrived at a conclusion and see if a human can do any better.
In fact the most annoying thing about this article is that it is a string of very confidently made, black and white statements, offered with no supporting evidence, and some of which I think are actually wrong… i.e. it suffers from the same kind of unsubstantiated self confidence that we complain about with the weaker models
LLMs that use Chain of Thought sequences have been demonstrated to misrepresent their own reasoning [1]. The CoT sequence is another dimension for hallucination.
So, I would say that an LLM capable of explaining its reasoning doesn't guarantee that the reasoning is grounded in logic or some absolute ground truth.
I do think it's interesting that LLMs demonstrate the same fallibility of low quality human experts (i.e. confident bullshitting), which is the whole point of the OP course.
I love the goal of the course: get the audience thinking more critically, both about the output of LLMs and the content of the course. It's a humanities course, not a technical one.
(Good) Humanities courses invite the students to question/argue the value and validity of course content itself. The point isn't to impart some absolute truth on the student - it's to set the student up to practice defining truth and communicating/arguing their definition to other people.
First, thank you for the link about CoT misrepresentation. I've written a fair bit about this on Bluesky etc but I don't think much if any of that made it into the course yet. We should add this to lesson 6, "They're Not Doing That!"
Your point about humanities courses is just right and encapsulates what we are trying to do. If someone takes the course and engages in the dialectical process and decides we are much too skeptical, great! If they decide we aren't skeptical enough, also great. As we say in the instructor guide:
"We view this as a course in the humanities, because it is a course about what it means to be human in a world where LLMs are becoming ubiquitous, and it is a course about how to live and thrive in such a world. This is not a how-to course for using generative AI. It's a when-to course, and perhaps more importantly a why-not-to course.
"We think that the way to teach these lessons is through a dialectical approach.
"Students have a first-hand appreciation for the power of AI chatbots; they use them daily.
"Students also carry a lot of anxiety. Many students feel conflicted about using AI in their schoolwork. Their teachers have probably scolded them about doing so, or prohibited it entirely. Some students have an intuition that these machines don't have the integrity of human writers.
"Our aim is to provide a framework in which students can explore the benefits and the harms of ChatGPT and other LLM assistants. We want to help them grapple with the contradictions inherent in this new technology, and allow them to forge their own understanding of what it means to be a student, a thinker, and a scholar in a generative AI world."
I'll give it a read. I must admit, the more I learn about the inner workings of LLM's the more I see them as simply the sum of their parts and nothing more. The rest is just anthropomorphism and marketing.
Whenever I see someone confidently making a comparison between LLMs and people, I assume they are unserious individuals more interested in maintaining hype around technology than they are in actually discussing what it does.
Someone saying "they feel" something is not a confident remark.
Also, there's plenty of neuroscience that is produced by very serious researchers that have no problems making comparisons between human brain function and statistical models.
Current LLMs are not the end-all of LLMs, and chain of thought frontier models are not the end-all of AI.
I’d be wary of confidently claiming what AI can and can’t do, at the risk of looking foolish in a decade, or a year, or at the pace things are moving, even a month.
That's entirely true. We've tried hard to stick with general principles that we don't think will readily be overturned. But doubtless we've been too assertive for some people's taste and doubtless we'll be wrong in places. Hence the choice to develop not a static book but rather living document that will evolve with time. The field is developing too fast for anything else.
I think that’s entirely the problem. You’re making linear predictions of the capabilities of non-linear processes. Eventually the predictions and the reality will diverge.
Every time someone claimed “emerging” behavior in LLMs it was exactly that. I can probably count more than 100 of these cases, many unpublished, but surely it is easy to find evidence by now.
Not quite, but it was the closest pithy quote I could think of to convey the point that things can be false for a long time before they are suddenly true without warning.
How about "Yes, they laughed at Galileo, but they also laughed at Bozo the Clown?"
We heard alllllll the same hype about how revolutionary the blockchain was going to be and look how that turned out.
It's a virtue to point out the emperor has no clothes. It's not a virtue to insist clothes tech is close to being revolutionary and if you just understand it harder, you'd see the space where the clothes go.
The post seems to be talking about the current capabilities of large language models. We can certainly talk about what they can or cannot do as of today, as that is pretty much evidence based.
The ground truth is chopped off into tokens and statistically evaluated. It is of course just a soup of ground truth that can freely be used in more or less twisted ways that have nothing to do or are tangent to the ground truth. While I enjoy playing with LLMs I don't believe they have any intrinsic intelligence to them and they're quite far from being intelligent in the same sense that autonomous agents such as us humans are.
Any all of the tricks getting tacked on are overfitting to the test sets. It's all the tactics we have right now and they do provide assistance in a wide variety of economically valuable tasks with the only signs of stopping or slowing down is data curation efforts
I've read that paper. The strong claim, confidently made in the OP is (verbatim) "they don’t engage in logical reasoning.".
Does this paper show that LLMs "don't engage in logical reasoning"?
To me the paper seems to mostly show that LLMs with CoT prompts (multiple generations out of date) are vulnerable to sycophancy and suggestion -- if you tell the LLM "I think the answer is X" it will try too hard to rationalize for X even if X is false -- but that's a much weaker claim than "they don't engage in logical reasoning". Humans (sycophants) do that sort of thing also, it doesn't mean they "don't engage in logical reasoning".
Try running some of the examples from the paper on a more up-to-date model (e.g. o1 with reasoning turned on) it will happily overcome the biasing features.
I think you'll find that humans have also demonstrated that they will misrepresent their own reasoning.
That does not mean that they cannot reason.
In fact, to come up with a reasonable explanation of behaviour, accurate or not, requires reasoning as I understand it to be. LLMs seem to be quite good at rationalising which is essentially a logic puzzle trying to manufacture the missing piece between facts that have been established and the conclusion that they want.
It's 1994. Larry Llyod Mayer has read the entire internet, hundreds of thousands of studies across every field, and can answer queries word for word the same as modern LLMs do. He speaks every major language. He's not perfect, he does occasionally make mistakes, but the sheer breadth of his knowledge makes him among the most employable individuals in America. The Pentagon, IBM, and Deloitte are begging to hire him. Instead, he works for you, for free.
Most laud him for his generosity, but his skeptics describe him as just a machine that spits out words. A stochastic parrot, useless for any real work.
I do anticipate it, but in the situations I'm asked to do such calculations, I don't usually have the option of refusing, nor would I want to. For most real would situations, it's generally better to arrive at a ballpark solution than to refuse to engage with the problem.
In the very unserious hypothetical I'm describing, I'd say Lloyd's capabilities match that of GPT-4. In this case, he's not a calculator, but he is a decent programmer, so like GPT-4 he quickly runs the operation through a script, rather than trying to figure it out in his head.
I would be very careful to claim exactly that as emergent properties seem kinda crucial for artificial and human intelligences. (Not to say that they are equally functioning nor useful.)
What experiment or measurement could I do to distinguish between a machine that “knows” the truth and a machine that merely “spits it out”? I’m trying to understand your terminology here
> I mean just try it yourself with o1, go as deep as you like asking how it arrived at a conclusion
I don't mean to disagree overall, but on this point the LLM can post-facto rationalize its output but it has no introspection and has absolutely no idea why it made a given bit of output (except in so far as it was a result of COT which it could reiterate to you). The set of weights being activated could be nearly disjoint when answering and explaining the answer.
One can also make the same argument about humans -- that they can't introspect their own minds and are just posthoc rationalizing their explanations unless their thinking was a product of an internal monolog that they can recount. But humans have a lifetime of self-interaction that gives a good reason to hope that their explanations actually relate to their reasoning. LLM's do not.
And LLMs frequently give inconsistent results, it's easy to demonstrate the posthoc nature of LLM's rationalizations too: Edit the transcript to make the LLM say something it didn't say and wouldn't have said (very low probability), and then have it explain why it said that.
(Though again, split brain studies show humans unknowingly rationalizing actions in a similar way)
I doubt people are very accurate at knowing why they made the choices they did. If you want them to recite a chain of reasoning they can but that is kind of far from most decision making most people do.
I agree people aren't great at this either and my post said as much.
However we're familiar with the human limits of this and LLMs are currently much worse.
This is particularly relevant because someone suffering from the mistaken belief that LLM's could explain their reasoning might go on to attempt to use that to justify the misapplication of an LLM.
E.g. fine tune some LLM using resume examples so that it almost always rejects Green-skinned people, but approve the LLMs use in hiring decisions because it is insistent that it would never base a decision on someone's skin color. Humans can lie about their biases of course, but a human at least has some experience with themselves while a LLM usually has no experience observing themself except for the output visible in their current window.
I also should have added that the ability to self explain when COT was in use only goes as deep as the COT, as soon as you probe deeper such that the content of the COT requires explanation the LLM is back in the realm of purely making stuff up again.
A non-hallucinated answer could only recount the COT and beyond that it would only be able to answer "Instinct."-- sure the LLM's response has reasoning hidden inside it, but that reasoning is completely inaccessible to the LLM.
I've had frontier reasoning models (or at least what I can access in ChatGPT+ at any given moment) give wildly inconsistent answers when asked to provide the underlying reasoning (and the CoT weren't always given). Inventing sources and then later denying them mentioned them. Backtracking on statements it claimed to be true. Hiding weasel words in the middle of a long complicated argument to arrive at whatever it decided the answer was. So I'm inclined to believe the reasoning steps here are also susceptible to all the issues discussed in the posted article.
Imagine I would tell my wife, that whenever we have a discussion, her opinion would only be valid when she can explain how she arrived at her conclusion.
Your wife is one of the end products of cutthroat competition across several billion years so let's just say her general intelligence has a fair bit more validation than 20 years of research.
Well, for what it's worth, I believe that this evolutionary pressure works as strongly, or even more so, against women who challenge men about the validity of their reasoning.
But we know how the LLM works, and that's exactely how the authors explain it. And that explain also the weird mistakes they do, that nothing with the ability of reason or having a ground truth would do.
I really do not understand how technical people can think they are sentient
If it's mimicry of reason is indistinguishable from real reasoning, how is it not reasoning?
Ultimately, an LLM models language and the process behind it's creation to some degree of accuracy or another. If that model includes a way to approximate the act of reasoning, then it is reasoning to some extent. The extent I am happy to agree is open for discussion, but that reasoning is taking place at all is a little harder to attack.
No, it is distinguishable from real reasoning. Real reasoning, while flawed in various ways, goes through personal experience of the evaluator. LLMs don't have that capability at all. They're just sifting though tokens and associate statistical parameters to it with no skin in the game so to speak.
LLM's have personal option by virtue of the fact they make statements of things they understand to the extent their training data allows. Their training data is not perfect, and in addition, through random chance the LLM will latch onto specific topics as a function of weight initialization and training data order.
This would form a filter not unlike, yet distinct from, our understanding of personal experience.
you could make the exact same argument against humans, we just learn to make sounds that elicit favourable responses. Besides, they have plenty "skin in the game", about the same as you or I.
It seems like an arbitrary distinction. If an LLM can accomplish a task that we’d all agree requires reasoning for a human to do, we can’t call that reasoning just because the mechanics are a bit different?
Yes because it isn't an arbitrary distinction. My good old TI-83 can do calculations that I can't even do in my head but unlike me it isn't reasoning about them, that's actually why it's able to do them so fast, and it has some pretty big implications about what it can't do.
If you want to understand where a systems limitations are you need to understand not just what it does but how it does it, I feel like we need to start teaching classes on Behaviorism again.
An LLM’s mechanics are algorithmically much closer to the human brain (which the LLM is modeled on) than a TI-83, a CPU, or any other Turing machine. Which is why, like the brain, it can solve problems that no individual Turing machine can.
Are you sure you aren’t just defining reasoning as something only a human can do?
My prior is reasoning is a conscious activity. There is a first person perspective. LLMs are so far removed mechanically from brains the idea they reason is not even remotely worth considering. Modeling neurons can be done with a series of pipes and flowing water, and that is not expected to give arise to consciousness either. Nor are nuerons and synapses likely to be sufficient for consciousness.
You know how we insert ourselves into the process of coming up with a delicious recipe? That first person perspective might be also necessary for reasoning. No computer knows the taste of mint, it must be given parameters about it. So if a computer comes up with a recipe with mint, we know it wasn’t via tasting anything ever.
A calculator doesn’t reason. A facsimile of something we have no idea about its role in consciousness has the same outlook as the calculator.
You’re right that my argument depends upon there being a great physical distinction between brains and H100s or enough water flowing through troughs.
But since we knew properties of wings were major comments to flight dating back to beyond the myths of Pegasus or Icarus, we rightly connected the similarities in the flight case.
Yet while we have studied neurons and know the brain is apart of consciousness, we don’t know their role in consciousness like the wing’s for flight.
If you got a bunch if daisy chained brains and that started doing what LLMs do, I’d change my tune—because the physical substrates are now similar enough. Focusing on neurons, and their facsimilized abstractions, may be like thinking flight depending upon the local cellular structure of a wing, rather than the overall capability to generate lift, or any other false correlation.
Just because an LLM and a brain get to the same answer, doesn’t mean they got there the same way.
Because we know practically nothing about brains so comparing them to LLMs is useless and nature is so complex that we're constantly discovering signs of hubris in human research.
See C-sections versus natural birth. Formula versus mother's milk. Etc.
I think you'd benefit from reading Helen Keller's autobigoraphy "the world i live in", you might reach the same conclusions I did, this being that perhaps conciousness is flavoured by our unique way of experiencing our world, but not strictly neccesary for conciousness of some kind or another to form. I beleive conciousness to be a tool a sufficently complex neural network will develop in order for it to achieve whatever objective it has been given to optimize for.
Taking a different tack from others in this thread. I don't think you can say that a TI-83 is not reasoning if it is doing calculations. Certainly it is not aware of any concepts of numbers and has no meaningful sense of the operation, but those are attributes of sentience, not reasoning. The reasoning ability of a calculator is extremely limited but what make those capabilities that it does have, non reasoning.
What non-sentience based property do you think something should have to be considered reasoning. Do you consider sentience and reasoning to be one and the same? If not then you should be able to indicate what distinguishes one from the other.
I doubt anyone here is arguing that chatGPT is sentient, yet plenty accept that it can reason to some extent.
>Do you consider sentience and reasoning to be one and the same?
No, but I think they share some similarities. You can be sentient without doing any reasoning, just through experience, there's probably a lot of simple life forms in that category. Where they overlap I think, is in that they require a degree of reflection. Reasoning I'd say is the capacity to distinguish between truth and falsehoods, to have mental content of the object you're reasoning about and as a consequence have a notion of understanding and an interior or subjective view.
The distinction I'd make is that calculation or memorization is not reasoning at all. My TI-83 or Stockfish can calculate math or chess but they have no notion of math or chess, they're basically Chinese rooms, they just perform mechanical operations. They can appear as if they reason, even a chess engine purely looking up results in a table base and with very simplistic brute force can play very strong chess but it doesn't know anything about chess. And with the LLMs you need to be careful because the "large" part does a lot of work. They often can sound like they reason but when they have to explain their reasoning they'll start to make up obvious falsehoods or contradictions. A good benchmark if something can reason is probably if it can.. reason about its reasoning coherently.
I do think the very new chain-of-thought models are more of a step into that direction, the further you get away from relying on data the more likely you're building something that reasons but we're probably very early into systems like that.
You say they are distinguishable. How would you experimentally distinguish two systems, one of which "goes through personal experience" and therefore is doing "real reasoning", vs one which is "sifting through tokens and associating statistical parameters"? Can you define a way to discriminate between these two situations?
I am getting two contradictory but plausible seeming replies when I ask about a certain set being the same when adding 1 to every value in the set, asked on how I ask the question.
What led you to beleive that mathematics is a good tool for evaluating an LLM? It is a thing they currently dont do well, since it is wildly out of domain of their training corpus - down the very way we structure information for an LLM to ingest. If we start doing the same for humans, most humans are in deep trouble.
Well I am studying mathematics, and I use the LLM to help me learn.
They aren't terrible, and they have all of arXiv to train on. Terrence Tao is doing some cool stuff with it - the idea will be an LLM to generate Lean proofs.
And I can assure you when I start to talk about these topics with the average human person that doesn't know the material, they just laugh at me. Even my wife who has a PhD in physics.
Here's some cool math I learned from a regular book, not an LLM:
I don't give a rat's ass about whether or not AI reasoning is "real" or a "mimicry". I care if machines are going to displace my economic value as a human-based general intelligence.
If a synthetic "mimicry" can displace human thinking, we've got serious problems, regardless of whether or not you believe that it's "real".
fair, but "logically consistent thoughts" is a subject of deep investigation starting from the early euclidean geometry to the modern godel's theorems.
ie, that logically consistent thinking starts from symbolization, axioms, proof procedures, world models. otherwise, you end up with persuasive words.
You just ruled out 99% of humans from having reasoning capabilities.
The beautiful thing about reasoning models is that there is no need to overcomplicate it with all the things you've mentioned, you can literally read the model's reasoning and decide for yourself if it's bullshit or not.
That's sort of arrogant, Most of that 99 (if that many) % could learn if inspired to and provided resources. And does use reasoning and instinct in day-to-day life even if it's as simple as "I'll take go shopping before I take my car to the shop so I have the groceries" or "hide this money in a new place so my husband doesn't drink it away". Models will get better over time, and yes humans only use models too.
Humans rely in cues to tell when each other is fabricating or lying. Machines don't have those cues, and fabricate their reasoning too. So we have a complicatedly difficult time trusting them.
>You just ruled out 99% of humans from having reasoning capabilities.
After a conversation with humans I think you'd agree 1% of them being able to reason deeply is a vast overestimation.
A good example to see how little people can reason is the following classic:
> Given the following premises derive a conclusion about your poems:
> 1) No interesting poems are unpopular among people of real taste.
> 2) No modern poetry is free from affectation.
> 3) All your poems are on the subject of soap bubbles.
> 4) No affected poetry is popular among people of taste.
> 5) Only a modern poem would be on the subject of soap bubbles.
The average person on the street won't even know where to start, the average philosophy student will fuck up the translation to first order logic, and a logic professor would need a proof assistant to get it right consistently.
Meanwhile o3-mini in 10 seconds:
We can derive a conclusion about your poems by following the logical implications of the given premises. Let’s rephrase each premise into a more formal form:
Premise 1: No interesting poems are unpopular among people of real taste.
This can be reworded as:
If a poem is interesting, then it is popular among people of real taste.
Premise 2: No modern poetry is free from affectation.
This tells us:
If a poem is modern, then it is affected (i.e., it shows affectation).
Premise 3: All your poems are on the subject of soap bubbles.
In other words:
Every one of your poems is about soap bubbles.
Premise 4: No affected poetry is popular among people of taste.
This implies:
If a poem is affected, then it is not popular among people of taste.
Premise 5: Only a modern poem would be on the subject of soap bubbles.
This means:
If a poem is about soap bubbles, then it is modern.
Now, let’s connect the dots step by step:
From Premise 3 and Premise 5:
All your poems are on the subject of soap bubbles.
Only modern poems can be about soap bubbles.
Conclusion: All your poems are modern.
From the conclusion above and Premise 2:
Since your poems are modern, and all modern poems are affected,
Conclusion: All your poems are affected.
From the conclusion above and Premise 4:
Since your poems are affected, and no affected poem is popular among people of taste,
Conclusion: Your poems are not popular among people of taste.
From Premise 1:
If a poem is interesting, it must be popular among people of taste.
Since your poems are not popular among people of taste (from step 3), it follows that:
Conclusion: Your poems cannot be interesting.
Final Conclusion:
Your poems are not interesting.
Thus, by logically combining the premises, we conclude that your poems are not interesting.
I could trace through that example quite quickly and I'm not an expert in logic, so I think you might be exaggerating some statements about difficulty here.
Except, human mimicry of "reasoning" is usually applied in service of justifying an emotional feeling, arguably even less reliable than the non-feeling machine.
this is the question that the greeks wrestled with over 2000 years ago. at the time there were the sophists (modern llm equivalents) that could speak persuasively like a politician.
over time this question has been debated by philosophers, scientists, and anyone who wanted to have better cognition in general.
Because we know what LLM's do. We know how they produce output. It's just good enough at mimicking human text/speech that people are mystified and stupified by it. But I disagree that "reasoning" is so poorly defined that we're unable to say an LLM doesn't do it. It doesn't need to be a perfect or complete definition. Where there is fuzziness and uncertainty is with humans. We still don't really know how the human brain works, how human consciousness and cognition works. But we can pretty confidently say that an LLM does not reason or think.
Now if it quacks like a duck in 95% of cases, who cares if it's not really a duck? But Google still claims that water isn't frozen at 32 degrees Fahrenheit, so I don't think we're there yet.
I think the third worst part of the GenAI hype era is that every other CS grad now thinks not only is a humanities/liberal arts degree meaningless but now also they're pretty sure they have a handle on the human condition and neurology enough to make judgment calls on what's sentient. If people with those backgrounds ever attempted to broach software development topics they'd be met with disgust by the same people.
Somehow it always seems to end up at eugenics and white supremacy for those people.
math arose firstly as a language and formalism in which statements could be made with no room for doubt. the sciences took it further and said that not only should the statements be free of doubt, but also that they should be testable in the real world via well defined actions which anyone could carry out. all of this has given us the gadgets we use today.
llm, meanwhile, is putting out plausible tokens which is consistent with its training set.
The writer is speaking from the perspective of the traditional philosophical understanding of a thinking being.
No, LLMs are not thinking beings with internal state. Even these "reasoning" models are just prompting the same LLM over and over again which is not true "logic" the way you and I think when we are presented with a new problem.
The key difference is they do not have actual logic, they rely on statistical calculations and heuristics to come up with the next set of words. This works surprisingly well if the thing has seen all text written, but there will always be new scenarios, new ideas it has not encountered and no these are not better than a human at those tasks and likely never will be.
However, what is happening is that our understanding of intelligence is being expanded, and our belief that we are going to be the only intelligent beings ever is under threat and that makes us fundamentally anxious.
> “the AI has no ground truth” (obviously it does, it has ingested every paper ever
it does not, AI is predicting the next ‘token’ based on the last ‘token’. There is no sentience, it’s machine learning except the machines are really strong.
It’d be illogical to say an AI has a ground truth just because it ‘ingested’ every paper ever.
What does sentience have to do with truth? I didn’t make that connection, you did. Wikipedia isn’t sentient but it contains a lot of truth. Raw data isn’t sentient but it definitely “has ground truth”.
When you have a machine that can only infer rules for reasoning from inputs [which are, more often than not, encoded in a very roundabout way within a language which is very ambiguous, like English], you have necessarily created something without "ground."
That's obviously useful in certain situations (especially if you don't know the rules in some domain!), but it's categorically not capable of the same correctness guarantees as a machine that actually embodies a certain set of rules and is necessarily constrained by them.
I'm contending that, like any good tool, there is a context where it is useful, and a context where it is not (and that we are at a stage where everything looks suspiciously like a nail).
Hey, I'm definitely on your side of the Great AI Wars--and definitely share your thoughts on the overall framing--but I think you're missing the serious nature of this contribution:
1. Small correction, it's actually a whole book AFAIK, and potentially someday soon, a class! So there's a lot more thought put in then the typical hot-take blog post. I also pop into one of these guy's replies on Bluesky to disagree on stuff fairly regularly, and can vouch for his good faith, humble effort to get it right (not something to be taken for granted!)
2. RE:“the AI has no ground truth”, I'd say this is true, no matter how often they're empirically correct. Epistemological discussions (aka "how do humans think") invariably end up at an idea called Foundationalism, which is exactly what it sounds like: that all of our beliefs can be traced back to one or more "foundational" beliefs that we either do not question at all (axioms) or very rarely do (premises on steroids?). In that sense, this phrase is simply recalling the hallucination debates we're all familiar with in slightly more specific, long-standing terms; LLMs do not have a systematic/efficient way of segmenting off such fundamental beliefs and dealing with them deliberately. Which brings me to...
3. RE:“can’t reason logically”, again this is a common debate that I think is being specified more than usual here. A lot of philosophy draws a distinction between automatic and deliberate cognition. I give credit to Kant for the best version, but it's really a common insight, found in ideas like "Fast vs. Slow thinking"[1], "first order vs. recursive" thought[2], "ego vs. superego"[3], and--most relevantly--intuition vs. reason.[4] At the very least, it's not a criticism to be dismissed out of hand based on empirical success rates!
4. Finally, RE:“can’t explain how they arrived at conclusions”, that's really just another discussion of point 2 in more explicitly epistemic terms. You can certainly ask o3 to reason (hehe) about the cognitive processing likely to be behind a given transcript, but it's not actually accessing any internal state, which is a very important distinction! o3 would do just as well explaining the reasoning behind a Claude output as it would with one of its own.
Sorry for the rant! I just leave a lot of comments that sound exactly like yours on "LLMs are useless" blog posts, and I wanted to do my best to share my begrudging appreciation for this work.
The title is absurdly provocative, but they're not dismissing LLMs, they're characterizing their weaknesses using a colloquial term -- namely "bullshit" as used for "lying without knowing that you're lying".
I've literally build a dynamic bench mark where I test reasoning models on their performance on deriving conclusions from assumptions through sequent calculus.
o3-mini high effort can derive chains that are 8 inference rules deep with >95% confidence I didn't have the money to test it further. This is better than the average professor in logic when given pen and paper.
It seems like a course critiquing 5 year old technology at this point.
In fact the most annoying thing about this article is that it is a string of very confidently made, black and white statements, offered with no supporting evidence, and some of which I think are actually wrong… i.e. it suffers from the same kind of unsubstantiated self confidence that we complain about with the weaker models