It is built on the observation how fast AI is getting better. If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.
Just two years ago, I was mesmerized by GPT-3's ability to understand concepts:
Nowadays, using it daily in a productive fashion feels completely normal.
Yesterday, I was annoyed with how cumbersome it is to play long mp3s on my iPad. I asked GPT-4 something like "Write an html page which lets me select an mp3, play it via play/pause buttons and offers me a field to enter a time to jump to". And the result was usable out of the box and is my default mp3 player now.
Two years ago it didn't even dawn on me that this would be my way of writing software in the near future. I have been coding for over 20 years. But for little tools like this, it is faster to ask ChatGPT now.
It's hard to imagine where we will be in 20 years.
The article doesn't say that LLMs aren't useful - the "hype" they mean is overestimating their capabilities. An LLM may be able to pass a "theory of mind" test, or it may fail spectacularly, depending on how you prompt it. And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future, but we're not there yet, and (AFAIK) nobody can tell how long it will take to get there.
> And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future [...]
I don't think so. When you say "it's not capable of actually reasoning", that's because it's a LLM; and if it "changes in the future", that's because the new system must no longer be a pure LLM. The appearance of reasoning in LLMs is an illusion.
Because it literally can't reason, and it also has no innate agency. Even the most dedicated creators of LLM-based AI technology have clearly and repeatedly stated that these are very sophisticated stochastic parrots with no sense of self. How much easier could it be to see that LLMs like GPT aren't actual thinking machines in the way we humans are?
Yes, many people reason based on pure pattern-matching and repeat opinions not because they've reasoned them but because they're what they've absorbed from other sources, but even the world's most unreasoned human being with at least functional cognition still uses an enormous amount of constant, daily, hourly self-directed decision-making for a vast variety of complex and simple, often completely spontaneous scenarios and tasks in ways that no machine we've yet built on Earth does or could.
Moreover, even when some humans say or "believe" things based on nothing more than what they've absorbed from others without really considering it in depth, they almost always do so in a particularly selective way that fits their cognitive, emotional and personal predispositions. This very selectiveness is a distinctly conscious trait of a self-aware being. Its something LLM's don't have as far as I've yet seen.
In the same way that illusions of anything else differ from the real thing. A wax apple is different from a real apple, even if it's hard to tell them apart sometimes. You may require further investigation to differentiate them (e.g., cutting open the apple or asking the AI to solve tricky reasoning questions), but if you can find a difference, there is a difference.
I have a hunch I am misunderstanding your argument, but does that mean the only way to build a "true reasoning machine" would be to just create a human.
I guess what I'm really asking, what would you expect to observe to make it not illusory?
To distinguish between "is an illusion" and "is not an illusion", you need evidence that isn't observational. The whole point of illusions is that observational evidence is unreliable.
A desert mirage in the distance is an illusion; to the observer, it's indistinguishable from an oasis. You can only tell that it's a mirage by investigating how the appearance was created (e.g. by dragging your thirsty ass through the sand, to the place where the oasis appeared to be).
If one has a reasonable understanding of 2 concepts that make up a larger system. And, such a system has little else in addition to those concepts, one is able to come up with that system by itself. Even though, it has never seen it, or their composition was never explained prior to that logical process.
The illusion happens when, clearly, the alleged reasoning behind how such a system comes to be is based on prior knowledge of the system as a whole. Meaning, its construction/source was within the training data.
That sounds like a good litmus test. Do you have a specific example you've tried?
My opinion is it isn't binary, rather it's a scale. Your example is a point on the scale higher than what it is now.
But perhaps that's too liberal a definition of "reasoning" , no idea.
We seem to move the goalposts on what constitutes human level intelligence as we discover the various capabilities exhibited in the animal kingdom. I wonder if it is/will be the same with AI
I'm really curious, are you able to demonstrate reasoning, not reasoning and the illusion of reasoning in a toy example? I'd like to see what each looks like.
Have you met someone who is full of bullshit? They sound REALLY convincing, except if you know anything about the subject, their statements are just word salad?
Have you met someone who's good at bullshitting their way out of a tough spot? There may be a word salad involved, but preparing it takes some serious skill and brainpower, and perhaps a decent high-level understanding of a domain. At some point, the word salad stops being a chain of words, and becomes a product of strong reasoning - reasoning on the go, aimed at navigating a sticky situation, but reasoning nonetheless.
The finest bullshitter I knew had serious skill and brainpower; and he BS'd about stuff he was expert in. It was really a sort of party trick - he could leave his peer experts speechless (more likely, rolling on the floor laughing).
His output was indeed word-salad, but he was eloquent. His bullshit wasn't fallacious reasoning; it didn't even have the appearance of reasoning at all. He was just stringing together words and concepts that sound plausible. It was funny, because his audience knew (and were supposed to know) that it was nonsense.
LLMs are the same, except they're supposed to pretend that it isn't nonsense.
Bullshit has an illusion of reasoning instead of actual reasoning. Basically you give arguments that sounds reasonable on the surface but there is no actual reasoning behind them.
> Bullshit has an illusion of reasoning instead of actual reasoning.
Bullshit is a good case to consider, actually. What is the relationship between bullshit and reasoning? You could argue that bullshit is fallacious reasoning, "pseudo-reasoning" based on incorrect rules of inference.
But these models don't use any rules of inference; they produce output that resembles the result of reasoning, but without reasoning. They are trained on text samples that presumably usually are the result of human reasoning. If you trained them on bullshit, they'd produce output that resembled fallacious reasoning.
No, I don't think the touchstone for actual reasoning is a human mind. There are machines that do authentic reasoning (e.g. expert systems), but LLMs are not such machines.
> Bullshit is a good case to consider, actually. What is the relationship between bullshit and reasoning?
None in principle, at least if you take the common definition of bullshit as saying things for effect, without caring whether they're true or false.
Fallacious reasoning will make you wrong. No reasoning will make you spew nonsense. Truth and lies and bullshit, all require reasoning for the structure of what you're saying to make sense, otherwise it devolves to nonsense.
> But these models don't use any rules of inference
Neither do we. Rules of inference came from observation. Formal reasoning is a tool we can employ to do better, but it's not what we naturally do.
> None in principle, at least if you take the common definition of bullshit as saying things for effect, without caring whether they're true or false.
Maybe splitting hairs, but I’d argue that the bullshitter is reasoning about what sounds good, and what sounds good needs at least some shared assumptions and resulting logical conclusion to hang its hat on. Maybe not always, but enough of the time that I would still consider reasoning to be a key component of effective bullshit.
That's not the case. It's very much in the realm of "we don't know what's going on in the network."
Rather than a binary it's much more likely that it's a mix of factors going into results that includes basic reasoning capabilities developed from the training data (much like board representations and state tracking abilities developed feeding board game moves into a toy model in Othello-GPT) as well as statistic driven autocomplete.
In fact often when I've seen GPT-4 get hung up with logic puzzle variations such as transparency, it tends to seem more like the latter is overriding the former, and changing up tokens to emoji representations or having it always repeat adjectives attached to nouns so it preserves variation context gets it over the hump to reproducible solutions (as would be expected from a network capable of reasoning) but by default it falls into the pattern of the normative cases.
For something as complex as SotA neural networks, binary sweeping statements seem rather unlikely to actually be representative...
As an PhD student in NLP who's graduating soon, my perspective is that language models do not demonstrate "reasoning" in the way most people colloquially use the term.
These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.
Unfortunately, this means the "reasoning" exhibited by language models is limited: if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.
That said, I do think adding reasoning capabilities is an active area of research, but we don't have a clear time horizon on when that might happen. Current prompting approaches are stopgaps until research identifies a promising approach for developing reasoning, e.g. combining latent space representations with planning algorithms over knowledge bases, constraining the logits based on an external knowledge verifier, etc (these are just random ideas, not saying they are what people are working on, rather are examples of possible approaches to the problem).
In my opinion, language models have been good enough since the GPT-2 era, but have been held back by a lack of reasoning and efficient memory. Making the language models larger and trained on more data helps make them more useful by incorporating more facts with increased computational capacity, but the approach is fundamentally a dead end for higher level reasoning capability.
I'm curious where you are drawing your definition or scope for 'reasoning' from?
For example, in Shuren The Neurology of Reasoning (2002) the definition selected was "the ability to draw conclusions from given information."
While I agree that LLMs can only process token to token and that juggling context is critical to effective operation such that CoT or ToT approaches are necessary to maximize the ability to synthesize conclusions, I'm not quite sure what the definition of reasoning you have in mind is such that these capabilities fall outside of it.
The typical lay audience suggestion that LLMs cannot generate new information or perspectives outside of the training data isn't the case, as I'm sure you're aware, and synthesizing new or original conclusions from input is very much within their capabilities.
Yes, this has to happen within a context window and occurs on a token by token basis, but that seems like a somewhat arbitrary distinction. Humans are unquestionably better at memory access and running multiple subprocesses on information than an LLM.
But if anything, this simply suggests that continuing to move in the direction of multiple pass processing of NLP tasks with selective contexts and a variety of fine tuned specializations of intermediate processing is where practical short term gains might lie.
As for the issue of new domains outside of training data, I'm somewhat surprised by your perspective. Hasn't one of the big research trends over the past twelve months been that in context learning has proven more capable than was previously expected? I'd agree that a zero shot evaluation of a problem type that isn't represented in a LLMs training data is setting it up for failure, but the capacity to extend in context examples outside of training data has proven relatively more successful, no?
> These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.
Is it not possible that this is essentially how our brains do it too? Attempt to plan by branching out to related ideas until they contain an answer. Any of these statements that AI can't be on track to reason like a human because of X seem to come with an implication that we have such a good model of the human brain that we know it doesn't X. But I'm not an expert on neuroscience so in many of these cases maybe that implication is true.
I think the word "essentially" is important here. I don't think we can observe how we think. How it appears in consciousness is not necessarily real - it might be just a model constructed ex-post.
I do not know that much about AI but I know at least something about cognitive psychology and it seems to me that a lot of claims about LLMs "not actually reasoning" and similar are probably made by CS graduates who have unreflected assumptions about how human thinking works.
I don't claim to know how human thinking works but if there is one thing I would conclude from studying psychology and knowing at least some basics about neuroscience, it would be that "it's not how it appears to us".
Nobody knows how human reasoning actually works but if I had to guess (based on my amateurish mental model of the functioning of the human brain), I would say that it is probably a lot closer to LLMs and a lot less rational than is commonly assumed in discussions like this one.
Maybe don't assume that PhD-level NLP researchers are out of touch on cognitive neuroscience topics related to language understanding. The latest research seems to indicate that language production and understanding exist separately from other forms of cognitive capacity. This includes people with global aphasia (no language ability) being able to do math, understand social situations, appreciate music, etc.
If you want to follow this more closely, I'd recommend the work of Evelina Fedorneko a cognitive neuroscientist at MIT who specializes in language understanding.
What this means in the context of LLMs is that next word prediction alone does not provide the breadth of cognitive capacity humans exhibit. Again, I'd posit GPT-2 is plenty capable as an LM, if combined with an approach to perform higher-level reasoning to guide language generation. Unfortunately, what that system is and how to design it currently eludes us.
First, you are right I should not assume anyone's knowledge (or lack thereof). It just popped into my mind as something that could explain the thing that's been puzzling me for months - what are people talking about when they say that LLMs are not actually reasoning, or Stable Diffusion is not actually creating? I wish I had not included that assumption and was inquisitive instead. Let me try again.
Maybe I diverted your focus the wrong way when I used LLMs as an example - what if I used more general term "neural network"? I said LLMs because this thread is about LLMs but let me clarify what I meant:
The thing that interests me in this thread is the claim that LLMs are "not capable of actually reasoning". Whether you agree with it depends on your mental model of actual reasoning, right?
My model of reasoning: the fundamental thing about it is that I have a network of things. The signal travels through the network guided by the weight of connections between them and fires some pattern of the things. That pattern represents something. Maybe it is a word in the case of LLMs (or syllable or whatever the token actually is - let's ignore those details for now) or a thought in the case of my brain (I was not saying people reason in language) - the resulting "token" can be many things, I imagine (like some mental representation of objects and their positions in spatial reasoning) - those are the specifics, but "essentially", the underlying mechanism is the same.
In my mental model, there is nothing fundamental that distinguishes what LLMs do from the "actual reasoning". If you have enough compute and good enough training data, you can create LLM reasoning as well as humans - that is my default hypothesis.
If I understand your position, you would not agree with that, correct? I am not claiming you are wrong - I know way too little for that. I would just be really curious - what is your mental model of actual reasoning? What does it have that LLMs do not have?
I know you mentioned that "these models have no capacity to plan ahead" - I am not sure I understand what you mean by that. Is this not just a matter of training?
BTW, I have talked about this topic before and some people apparently see conscience as a necessary part of actual reasoning. I do not - do you?
> if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.
True. But look at the Phi-1.5 model - it punches 5x above its weight limit. The trick is in the dataset:
> Our training data for phi-1.5 is a combination of phi-1’s training data (7B tokens) and newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.). We carefully selected 20K topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity. We point out that the only non-synthetic part in our training data for phi-1.5 consists of the 6B tokens of filtered code dataset used in phi-1’s training (see [GZA+ 23]).
> We remark that the experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.
Synthetic data has its advantages - less bias, more diverse, scalable, higher average quality. But more importantly, it can cover all the permutations and combinations of skills, concepts, situations. That's why a small model just 1.5B like Phi was able to work like a 7B model. Usually at that scale they are not coherent.
Define reasoning. Because in my definition GPT 4 could reason without doubt. It definitely can't reason better than experts in the field, but it could reason better than say interns.
I don't have access to GPT 4 but I'd be interested to see how it does on a question like this:
"Say I have a container with 50 red balls and 50 blue balls, and every time I draw a blue ball from the container, I add two white balls back. After drawing 100 balls, how many of each different color ball are left in the container? Explain why."
... because on GPT 3.5 the answer begins like the below and then gets worse:
"Let's break down the process step by step:
Initially, you have 50 red balls and 50 blue balls in the container.
1) When you draw a blue ball from the container, you remove one blue ball, and you add two white balls back. So, after drawing a blue ball, you have 49 blue balls (due to removal) and you add 2 white balls, making it a total of 52 white balls (due to addition) ..."
If I was hiring interns this dumb, I'd be in trouble.
EDIT: judging by the GPT-4 responses, I remain of the opinion I'd be in trouble if my interns were this dumb.
This is such a flawed puzzle. And GPT 4 answers it rightly. It is a long answer but the last sentence is "This is one possible scenario. However, there could be other scenarios based on the order in which balls are drawn. But in any case, the same logic can be applied to find the number of each color of ball left in the container."
The ability to identify that there isn't a simple closed form result is actually a key component of reasoning. Can you stick the answer it gives on a gist or something? The GPT 3.5 response is pure, self-contradictory word salad and of course delivered in a highly confident tone.
> The ability to identify that there isn't a simple closed form result is actually a key component of reasoning.
If that's the case, then most humans alive would fail to meet this threshold. Finding a general solution to a specific problem, and identifying whether or not there exist a closed-form solution, and even knowing these terms, are skills you're taught in higher education, and even the people who went through it are prone to forget all this unless they're applying those skills regularly in their life, which is a function of specific occupations.
GPT 4 goes into detail about one example scenario, which most humans won't do, but it is technically correct answer as it said it depends on the order.
Its answer isn't correct, this isn't a possible ending scenario:
- *Ending Scenario:*
- Red Balls (RB): 0 (all have been drawn)
- Blue Balls (BB): 50 (none have been drawn)
- White Balls (WB): 0 (since no blue balls were drawn, no white balls were added)
- Total Balls: 50
> but it is technically correct answer as it said it depends on the order.
It should give you pause that you had to pick not only the line by which to judge the answer but the part of the line. The sentence immediately before that is objectively wrong:
I asked GPT4 and it gave a similar response. So then I asked my wife and she said, "do you want more white balls at the end or not?" And I realized as CS or math question we assume that the draw is random. Other people assume that you're picking which ball to draw.
So I clarified to ChatGPT that the drawing is random. And it replied: "The exact numbers can vary based on the randomness and can be precisely modeled with a simulation or detailed probabilistic analysis."
I asked for a detailed probabilistic analysis and it gives a very simplified analysis. And then basically says that a Monte Carlo approach would be easier. That actually sounds more like most people I know than most people I know. :-)
I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?
Seems like quite a difficult question to compute exactly.
I reworded the question to make it clearer and then it was able to simulate a bunch of scenarios as a monte carlo simulation. Was your hope to calculate it exactly with dynamic programming? GPT-4 was not able to do this, but I suspect neither could a lot of your interns.
>I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?
These are very good questions that anyone with the ability to reason would ask if given this problem.
You're asking GPT to do maths in its head, the AI equivalent of a person standing in the middle of the room with no tools and getting grilled in a oral examination of their knowledge.
Instead, collaborate with it, while giving it the appropriate tools to help you.
I asked it to write a Monte Carlo simulation of the problem in Wolfram Mathematica script. It did this about 10-100x faster than I would have been able to. It made a few small mistakes with the final visualisation, but I managed to get it to output a volumetric plot showing the 3D scatter plot of the histogram of possible outcomes.
It can reason better than most humans put into the same situation.
This problem doesn't result in a constant value, it results in a 3D probability distribution! Very, very few humans could work that out without tools. (I'm including pencil and paper in "tools" here.)
With only a tiny bit of coaxing, GPT 4 produced an animated video of the solution!
Try to guess what fraction of the general population could do that at all. Also try to estimate what fraction of general software developers could solve it in under an hour.
A human could get a valid end state most of the time, gpt-4 seems to mess up more than it got it right based on the examples posted here. So to me it seems like gpt-4 is worse than humans.
Gpt-4 with help from a competent human will of course do better than most humans, but that isn't what we are discussing.
I disagree. Don't assume "most humans" are anything like Silicon Valley startup developers. Most developers out there in the wild would definitely struggle to solve problems like this.
For example, a common criticism of AI-generated code is the risk of introducing vulnerabilities.
I just sat in a meeting for an hour, literally begging several developers to stop writing code vulnerable to SQL injection! They just couldn't understand what I was even talking about. They kept trying to use various ineffective hacky workarounds ("silver bullets") because they just didn't grok the the problem.
>It can reason better than most humans put into the same situation.
On what basis do you allege this? People say the most unhinged stuff here about AI, and it so often goes completely unchallenged. This is a huge assertion that you are making.
The equivalent of what current-gen LLMs do is an oral examination. Picture standing in the middle of a room surrounded by subject matter experts grilling you for your knowledge of various random topics. You have no tools, no calculator, no pencil and paper.
You’re asked a question and you just have to spit out the answer. No option to backtrack, experiment, or self correct.
“Translate this to Hebrew”.
“Is this a valid criticism of this passage from a Platonic perspective?”
“Explain counterfactual determinism in Quantum Mechanics.”
“What is the cube root of 74732?”
You would fail all of these. The AI gets 3 of 4 correct.
Tell me who’s smarter?
You because of your preconceptions, or because of real superiority?
These were the first responses each time, with no massaging/retires/leading answers. I will say it's not entirely there. I re-ran the initial question a few times afterwards and one was basically giberish.
There's a couple of scenarios, which depend on when you run out of blue balls:
1) If you draw all 50 blue balls first, for every blue ball you draw, you are adding 2 white balls. Hence, once you've drawn all the blue balls, you have replaced them with 100 white balls. This means you've drawn 50 balls total and there's now 100 white balls and 50 red balls (150 balls) in the container. Then, you continue and draw the next 50 balls. These could be any combination of white and red balls, we don't know which color is preferred when you continue drawing.
2a) If you draw all red balls in these next 50 draws, you would have 50 red balls out and 50 blue balls out. So, you would end with 100 white balls in the container.
2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.
2c) If you draw a mixture of red and white balls in these next 50 draws, the number leftover in the container would vary according to the specific combination drawn.
Remember, the order in which the balls are drawn matters to this question since we have a rule in place that changes the overall total (or population) of balls within the container. In each of these scenarios, you've drawn 50 times initially (all blue) and then 50 times again (in some unknown combination of red and white). You've drawn 100 times total and changed the number of white balls in the container from zero to an amount dependent on how many times you drew a white ball on your second round of 50 draws.
2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.
... so after removing 100 balls, I've removed 150 balls? And the 150 balls that I've removed are red, white and blue despite the fact that I removed 50 blue balls initially and then 50 white ones.
Just because it fails one test in a particular way doesn’t mean it lacks reasoning entirely. It clearly does have reasoning based on all the benchmarks it passses
You are really trying to make it not have reasoning for your own benefit
> You are really trying to make it not have reasoning for your own benefit
This whole thread really seems like it's the other way around. It's still very easy to make ChatGPT to spit out obviously wrong answers depending on the prompt. If it had actual ability to reason as opposed to just generating continuation to your prompt, the quality of the prompt wouldn't matter as much
GPT 4 still does a lot of dumb stuff on this question, you see several people post outright wrong answer and say "Look how gpt-4 solved it!". That happens quite a lot in these discussions, so it seems like the magic to get gpt-4 to work is that you just don't check its answers properly.
I've had to work with imperfect machines a lot in my recent past. Just because sometimes it breaks, doesn't mean it's useless. But you do have to keep your eyes on the ball!
I think that's the crux of the whole argument. It's an imperfect (but useful) tool, which sometimes produces answers that make it seem like it can reason, but it clearly can't reason on its own in any meaningful way
There's a reason you see people walking around in hard hats and steel toed boots in some companies. It's not because everything works perfectly all the time!
Seems like it reasons it's way to this answer at the end to me:
Mind you, while averages are insightful, they don't capture the delightful unpredictability of each individual run. Would you like to explore this delightful chaos further, or shall we move on to other intellectual pursuits?
Took a bit of massaging and I enabled the Data Analysis plugin which lets it write python code and run it. It looks like the simulation code is correct though.
ChatGPT is trained on text that includes most reasoning problems that people come up with.
You see reasoning issues when you use more real world examples, rather than theoretical tests.
I had 4 failure states.
1) Summarization: It summarized 3 transcripts correctly, for the fourth it described the speaker as a successful VC. The speaker was a professor.
2) It was to act as a classifier, with a short list of labels. Depending on the length of text, the classifier would swap over to text gen. Other issues included novel labels, new variations of labels, and so on.
3) Agents - This died on the vine. Leave having to learn asynch, vector DBs or whatever. You can never trust the output of an LLM, so you can never chain agents.
4) I focused on using ChatGPT to complete a project. I hadnt touched HTML ever - the goal was to use ChatGPT to build the site. This would cover design, content, structure, development, hosting, and improvements.
I still have trauma. Wrong code, bad design, were base issues. If code was correct, it simply meant I had dug a deeper grave. I had anticipated 70% of the work being handled by ChatGPT, it ended up at 30% at the most.
ChatGPT is great IF you already are a subject expert - you can brush over the issues and move on.
"Hallucinations" is the little bit of string that you pull on, and the rest unravels. There are no hallucinations, only humans can hallucinate - because we have an actual ground truth to work with.
LLMs are only creating the next token. For them to reason, they must be holding structures and proxies in some data store, and actively altering it.
Its easier to see once you deal with hallucinations.
If it can solve basic logic problems, then it could reason. And if it could write code of a new game with new logic, then it could reason for sure.
Example of basic problem: In a shop, there are 4 dolls of different heights P,Q,R and S. S is neither as tall as P nor as short as R. Q is shorter than S but taller than R. If Kittu wants to purchase the tallest doll, which one should she purchase? Think step by step.
> And that's because, despite all of its training data, it's not capable of actually reasoning.
Your conclusion doesn't follow from your premise.
None of these models are trained to do their best on any kind of test. They're just trained to predict the next word. The fact that they do well at all on tests they haven't seen is miraculous, and demonstrates something very akin to reasoning. Imagine how they might do if you actually trained them or something like them to do well on tests, using something like RL.
> None of these models are trained to do their best on any kind of test
How do you know GPT-4 wasn't trained to do well on these tests? They didn't disclose what they did for it, so you can't say it wasn't trained to do well on these tests. That could be the magic sauce for it.
They are trained to predict next tokens in a stream.
That is the learning algorithm.
The algorithm they learn, in response, is quite different. Since that learned algorithm is based on the training data.
In this case the models learn to sensibly continue text or conversations. And they are doing it so well it’s clear they have learned to “reason” at an astonishing level.
Sometimes, not as good as a human.
But in a tremendous number of ways they are better.
Try writing an essay about the many-worlds interpretation of the quantum field equation, from the perspective of Schrödinger, with references to his personal experiences, using analogies with medical situations, formatted as a brief for the Supreme Court, in Dr. Seuss prose, in a random human language of choice.
In real time.
While these models have some trouble with long chains of reasoning, and reasoning about things they don’t have experiences (different modalities, although sometimes they are surprisingly good), it is clear that they can also reason combining complex information drawn from there whole knowledge base much faster and sensibly than any human has ever come close to.
Where they exceed us, they trounce us.
And where they don’t, it’s amazing how fast they are improving. Especially given that year to year, biological human capabilities are at a relative standstill.
——
EDIT: I just tried the above test. The result was wonderful whimsical prose and references, that made sense at a very basic level, that a Supreme Court of 8 year olds would likely enjoy, especially if served along with some Dr. Seuss art! In about 10-15 seconds.
Viewed as a solution to an extremely complex constraint problem, that is simply amazing. And far beyond human capabilities on this dimension.
You are right that the process involves predicting words from training data. But you can still make training data focused on passing these tests. Adding millions of test questions to all of these to optimize for answering test questions is perfectly doable when you have the resources OpenAI has.
A strong hint to what they focused on in their training process is what metrics they used in their marketing of the model. You should always bet on models being optimized to perform on whatever metrics they themselves give you when they market the model. Look at the gpt-4 announcement, what metrics did they market? So what metrics should we expect they optimized the model for?
Exam results are the first metric they mentions, so exams was probably one of their top priorities when they trained gpt-4.
Yes, absolutely. They can adjust performance priorities.
By the relative mix of training data, additional fine tuning training phases, and/or pre-prompts that give the model extra guidance relative to particular task types.
LLMs are trained to predict text, and one of the results of this is the LLM has as many "faces" as exist in the training data, so it's going to be _very_ different depending on the prompt. It's not a consistent entity like a human. RLHF is an attempt to mediate this, but it doesn't work perfectly.
I’m often confused over claims on the reasoning capabilities. It is often mentioned in debates as a clear and undeniable issue with current LLM’s. So since this claim can be made, where are said tests about reasoning skills that GPT-4 fails?
If it’s a debate on the illusion of reasoning, I’d be careful how I step here, because it’s been found these things probably work so well because the human brain is also a biological real-time prediction machine and “just” guessing too: https://www.scientificamerican.com/article/the-brain-guesses...
Or the corollary: that there's really no such thing as anthropomorphic. There's inputs and outputs, and an observer's opinion on how well the outputs relate to the inputs. Thing producing the outputs, and the observer, can be human or not human. Same difference.
It absolutely is anthropomorphizing to claim "GPT-3's ability to
understand concepts" rather than simply calling it "reproduce, mix and match text from an enormous corpus". And we can totally legitimately compare to a jury of human observers' opinions on how well(/badly) the output generated relates to the inputs.
For the specific example the OP cited "War: like being eaten by a dragon and then having it spit you out"
then unless its answer to "Where were you in between being eaten by a dragon and before it spat you out?" is "in the dragon's digestive system" that isn't understanding.
And I'm curious to see it answer "Dragons only exist in mythology; does your analogy mean war doesn't exist either? Why not compare to an animal that exists?"
Anyone who have used GPT-4 for any amount of time knows it is definitely not "reproduce, mix and match text from an enormous corpus". Also what's the point of the question. GPT has been specifically trained to not answer this. But in any case here is the answer with instruct model which generally is more open ended:
Sure, I can compare it to a real animal such as bears. Just like dragons are depicted as fearsome and destructive creatures in mythology, bears are also portrayed as powerful and dangerous animals in many cultures. However, just as the existence of dragons is a matter of fiction, the concept of a "war on bears" would also be nonsensical.
In reality, the issue of war is complex and multifaceted, involving political, economic, and societal factors. It cannot be reduced to a simplistic battle between humans and a particular animal species.
Furthermore, the use of animals in analogies can be problematic as it may trivialize the gravity and complexity of real-world issues. Wars and conflicts have serious consequences for human lives, whereas animals do not possess the same agency or ability to wage war.
In conclusion, while dragons may serve as a metaphor for war in mythology, comparing war to a fictional creature diminishes the seriousness of the topic and fails to capture the complexities of real conflicts.
>Also what's the point of the question. GPT has been specifically trained to not answer this. But in any case here is the answer with instruct model which generally is more open ended:
It would demonstrate basic reasoning skills that weren't things one would "reproduce, mix and match text from an enormous corpus". Like the response you provided, which is meaningless word salad. It's a prima facie takedown of your post.
This is like people who hate poetry, insisting their bad poetry is good poetry. Why? Because who else is to say otherwise! Well, the good poets. The people that appreciate poetry will know the difference. Everyone else wont care, save for those invested in having to sell their bad poetry as good.
What has poetry to do with reasoning? You should think GPT as a terse person who refuses this kind of thing. Certainly there are people like that who have good reasoning skill but can't answer your question in a poetic way(I being one).
Can AI people stop with the defense of "what if thing really is not a thing?" "what if thing is really what humans do?" These aren't answers to questions. Its deflecting nonsense posed as philosophical thought.
We are in a Cambrian Explosion on the software side and hardware hasn’t yet reacted to it. There’s a few years of mad discovery in front of us.
People have different impressions as to the shape of the curve that’s going up and right, but only a fool would not stop and carefully take what is happening.
Exactly and things are actually getting crazy now.
Pardon the tangent but for some reason this hasn't reached the frontpage on HN yet: https://github.com/OpenBMB/ChatDev
Making your own "internal family system" of AI's is a making this exponential (and frightening), like an ensemble on top of the ensemble, with specific "mindsets", that with shared memory can build and do stuff continuously. Found this from a comp sci professor on Tiktok so be warned: https://www.tiktok.com/@lizthedeveloper/video/72835773820264...
I remember a couple of comments here on HN when the hype began about how some dude thought he had figured out how to actually make an AGI - can't find it now, but it was something about having multiple ai's with different personalities discoursing with a shared memory - and now it seems to be happening.
This coupled with access to linux containers that can be spawned on demand, we are in for a wild ride!
> If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.
That's a big assumption to make. You can't assume that the rate of improvement will stay the same, especially over a period of 2 decades, which is a very long time. Every advance in technology hits diminishing returns at some point.
Technological progress seems rather accelerated than diminishing to me.
Computers are a great example: They have been getting more capable exponentially over the last decades.
In terms of performance (memory, speed, bandwidth) and in terms of impact. First we had calculators, then we had desktop applications, then the Internet and now we have AI.
And AI will help us get to the next stage even faster.
A lot of the progress in the last 3-4 years was predictable from GPT-2 and especially GPT-3 onwards - combining instruction following and reinforcement learning with scaling GPT. With research being more closed, this isn't so true anymore. The mp3 case was predictable in 2020 - some early twitter GIFs showed vaguely similar stuff. Can you predict what will happen in 2026/7 though, with multimodal tech?
I simply don't see it a being the same today. The obvious element of scaling or techniques that imply a useful overlap isn't there. Whereas before researchers brought together excellent and groundbreaking performance on different benchmarks and areas together as they worked on GPT-3, since 2020, except instruction following, less has been predictable.
Multi modal could change everything (things like the ScienceQA paper suggest so), but also, it might not shift benchmarks. It's just not so clear that the future is as predictable or will be faster than the last few years. I do have my own beliefs similar to Yann Lecun about what architecture or rather infrastructure makes most sense intuitively going forward, and there's not really the openness we used to have from top labs to know if they are going these ways, or not. So you are absolutely right that it's hard to imagine where we will be in 20 years, but in a strange way, because it is much less clear than in 2020 where we will be in 3 years time onwards, I would say it is much less guaranteed progress than it is felt by many...
I was also thinking about how quickly AI may progress and am curious for your or other people's thoughts. When estimating AI progress, estimating orders of magnitude sounds like the most plausible way to do it, just like Moore's law has guessed the magnitude correctly for years. For AI, it is known that performance increases linearly when the model size increases exponentially. Funding currently increases exponentially meaning that performance will increase linearly. So, AI will increase linearly as long as the funding does too. On top of this, algorithms may be made more efficient, which may occasionally make an order of magnitude improvement. Does this reasoning make sense? I think it does but I could be completely wrong.
You can check my post history to see how unpopular this point of view is, but the big "reveal" that will come up is as follows:
The way that LLMs and humans "think" is inherently different. Giving an LLM a test designed for humans is akin to giving a camera a 'drawing test.'
A camera can make a better narrow final output than a human, but it cannot do the subordinate tasks that a human illustrator could, like changing shadings, line width, etc.
An LLM can answer really well on tests, but it often fails at subordinate tasks like 'applying symbolic reasoning to unfamiliar situations.'
Eventually the thinking styles may converge in a way that makes the LLMs practically more capable than humans on those subordinate tasks, but we are not there yet.
Most of the improvements apparently come from training larger models with more data. Which is part of the problem mentioned in the article - the probability that the model just memorizes the answers to the tests is greatly increased.
AI is getting subjectively better, and we need better tests to figure out if this improvement is objectively significant or not.
> Most of the improvements apparently come from training larger models with more data.
OpenAI is reportedly losing 4 cents per query. With a thousandfold increase in model size, and assuming linear scale in cost, that's a problem. Training time is going to go up too. Moore's law isn't going to help any more. Algorithmic improvements may help...if any significant ones can be found.
Training a model on more data improves generalization not memorization.
To store more information in the same number of parameters requires the commonality between examples to be encoded.
In contrast, the less data trained on, especially if repeated, lets the network learn to provide good answers for that limited set without generalizing. I.e. memorizing.
——
It’s the same as with people. The more variations people see of something, the more likely they intuit the underlying pattern.
The fewer examples, the more likely they just pattern match.
> It’s the same as with people. The more variations people see of something, the more likely they intuit the underlying pattern.
> The fewer examples, the more likely they just pattern match.
A kid who uses a calculator and just fills in the answer to every question will see a lot more examples than a kid that learned by starting from simple concepts and understanding each step. But the kid who focused on learning concepts and saw way fewer problems will obviously have a better understanding here.
So no, you are clearly wrong here, humans doesn't learn that way at all. These models learn that way, you are right on that, but humans don't.
And since the calculator itself has already a general understanding, it would seem completely counter productive to start training a computer or child by first giving them a machine that has already solved the problem.
Also, for what it’s worth, I am speaking from many years experience not just training models but creating the algorithms that train them.
Replace "uses calculator" to "looks through solved problems", same thing. Not sure what you don't understand. Humans don't build understanding by seeing a lot of solved examples.
To make a human understand we need to explain how things work to them. You don't just show examples. A human who is just shown a lot of examples wont understand much at all, even if he tries to replicate them.
> Also, for what it’s worth, I am speaking from many years experience not just training models but creating the algorithms that train them.
Humans learn vast amounts of information from examples.
They learn their first words, how to walk, what a cat looks like from many perspectives, how to parse a visual scene, how to parse the spoken word, interpret facial expressions and body language, how different objects move, how different creatures behave, different materials feel, what things cause pain, what things taste like and how they make them feel, how to get what they want, how to climb, how not to fall, all by trial & example. On and on.
And yes, as we get older we get better and better at learning 2nd hand from others verbally, and when people have the time to show us something, or with tools other people already invented.
Like how a post-trained model picks up on something when we explain it via a prompt.
But that is not the kind of training being done by models at this stage. And yet they are learning concepts (pre-prompt) that, as you point out, you & I had to have explained to us.
> Like how a model picks up on when we explain something to it after it has been trained.
Models don't learn by you telling them something, the model doesn't update itself. A human updates their model when you explain how something works to them, that is the main way we teach humans. Models don't update themselves when we explain how something works to them, that isn't how we train these models, so the model isn't learning its just evaluating. It would be great if we could train models that way, but we can't.
> Humans learn vast amounts of information from examples.
Yes, but to understand things in school those examples comes with an explanation of what happens. That explanation is critical.
For example, a human can learn to perform legal chess moves in minutes. You tell them the rules each piece has to follow and then they will make legal moves in almost every case. You don't do it by showing them millions of chess boards and moves, all you have to do is explain the rules and the human then knows how to play chess. We can't teach AI models that way, this makes human learning and machine learning fundamentally different still.
And you can see how teaching rules creates a more robust understanding than just showing millions of examples.
> you explain how something works to them, that is the main way we teach humans
I am curious who taught you to recognize sounds, before you understood language, or how to interpret visual phenomena, before you were capable of following someone’s directions.
Or recognize words independent of accent, speed, pitch, or cadence. Or even what a word was.
Humans start out learning to interpret vast amounts of sensory information, and predictions of results of there physical motor movements, from a constant stream of examples.
Over time they learn the ability to absorb information indirectly from others too.
This is no different from models, except that it turns out, they can learn more things, at a higher degree of abstraction, just from example than us.
And work on their indirect learning (I.e. long term retention of information we give them via prompts), is just beginning.
But even as adults, our primary learning mode is experience is from the example situations we encounter non-stop as we navigate life.
Even when people explain things, we generalize a great deal of nuance and related implications beyond what is said.
“Show, don’t tell”, isn’t common advice for no reason. We were born example generalizers.
Then we learn to incorporate indirect information.
You are right, but I think it is really important to have this difference in learning in mind, because not being able to learn rules during training is the main weakness in these models currently. Understanding that weakness and how that makes their reasoning different from humans is key both to using these models and for any work on improving them.
For example, you shouldn't expect it to be able to make valid chess moves reliably, that requires reading and understanding rules which it can't do during training. It can get some understanding during evaluation, but we really want to be able to encode that understanding into the model itself rather than have to keep it in eval time.
There is a distinction between reasoning skills learned inductively (generalizing from examples), and reasoning learned deductively (via compact symbols or other structures).
The former is better at recognition of complex patterns, but can incorporate some basic deduction steps.
But explicit deduction, once it has been learned, is a far more efficient method of reasoning, and opens up our minds to vast quantities of indirect information we would never have the time or resources to experience directly.
Given how well models can do at the former, it’s going to be extremely interesting to see how quickly they exceed us at the latter - as algorithms for longer chains of processing, internal “whiteboarding” as a working memory tool for consistent reasoning over many steps and many facts, and long term retention of prompt dialogs, get developed!
I pretty much want the LLM to be great at memorizing things. That's what I'm not great at.
If it had perfect recall I would be so thrilled.
And just because it's memorized the data--as all intelligences would need to do to spit data out--doesn't mean it can't still do useful operations on the data, or explain it in different words, or whatever a human might do with it.
Do we? I use gpt-4 daily and it matters not to me what the source of the "intelligence" is. It's subjective what "intelligence" even means. It's subjective how the brain works. Almost by definition AI is "things that can't be objectively measured".
It'd be a bit faster to get up and running with ChatGPT. In the AI, you'd have to phrase the instruction & copy the output into a file. For search, you have to do both those things and learn a UI that wasn't built to taste.
Do you (or anyone) know of any products that allow for iterating on the generated output through further chatting with the ai? What I mean, is that each subsequent prompt here either generated a new whole output, or new chunks to add to the output. Ideally, whether generating code or prose, I’d want to keep prompting about the generated output and the AI further modifies the existing output until it’s refined to the degree I want it.
Or is that effectively what Copilot/cursor do and I’m just a bad operator?
> Do you (or anyone) know of any products that allow for iterating on the generated output through further chatting with the ai? What I mean, is that each subsequent prompt here either generated a new whole output, or new chunks to add to the output. Ideally, whether generating code or prose, I’d want to keep prompting about the generated output and the AI further modifies the existing output until it’s refined to the degree I want it.
No problem, it was a fun morning exercise for me :)
Copilot, at least from what little I did in vscode, isn't as powerful as this. I think there's a GPT4 mode for it that I haven't played with that'd be a lot closer to this.
I used gpt4 to write a script that I can ssh from my iPhone to a m1 that downloads the mp3 from a yt url on my iPhone clipboard. The only thing I am missing is automating the sync button when the iPhone is on the same home wifi to add the mp3 to the music app.
> Two years ago it didn't even dawn on me that this would be my way of writing software in the near future
So you were ignorant two years ago, GitHub Copilot was already available to users back then. The only new big thing the past two years was GPT-4, and nothing suggest anything similar will come the next two years. There are no big new things on the horizon, we knew for quite a while that GPT-4 was coming, but there isn't anything like that this time.
But when Copilot came out, I was indeed ignorant! I remember when a friend showed it to me for the first time. I was like "Yeah, it outputs almost correct boilerplate code for you. But thankfully my coding is so that I don't have to write boilerplate". I didn't expect it to be able to write fully functional tools and understand them well enough to actually write pretty nice code!
Regarding "there isn't anything like that this time." : Quite the opposite! We have not figured out where using larger models and throwing more data at them will level off! This could go on for quite a while. With FSD 12, Tesla is already testing self driving with a single large neural net, without any glue code. I am super curious how that will turn out.
Well, my point is that you perceive progress to be fast since you went from not understanding what existed to later getting in on it. That doesn't mean progress was that fast, it means that you just discovered a new domain.
Trying to extrapolate actual progress is bad in itself, but trying to extrapolate your perceived progress is even worse.
Yeah you have hit the nail on the head here. A lot was predictable with seeing that GPT-2 could reasonably stay within language and generate early coherent structures, that coming at the same time as instructions with the T5 stuff and the widespread use of embeddings from BERT told us this direction was likely, it's just for many people this came to awareness in 2021/22 rather than the 2018-2020 ramp up the field/hobbyists experienced.
Those are image/voice generation, the topic is about potential replacement of knowledge workers such as coders. The discussion about image/voice generation is a very different topic since nobody thinks those are moving towards AGI and nobody argued they were "conscious" etc.
It is built on the observation how fast AI is getting better. If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.
Just two years ago, I was mesmerized by GPT-3's ability to understand concepts:
https://twitter.com/marekgibney/status/1403414210642649092
Nowadays, using it daily in a productive fashion feels completely normal.
Yesterday, I was annoyed with how cumbersome it is to play long mp3s on my iPad. I asked GPT-4 something like "Write an html page which lets me select an mp3, play it via play/pause buttons and offers me a field to enter a time to jump to". And the result was usable out of the box and is my default mp3 player now.
Two years ago it didn't even dawn on me that this would be my way of writing software in the near future. I have been coding for over 20 years. But for little tools like this, it is faster to ask ChatGPT now.
It's hard to imagine where we will be in 20 years.