There are some serious problems with this paper, namely that I just tried to reproduce it and it failed every test: I tested out several of the problems presented in the paper which it was claimed that GPT-4 failed on and it passed every one every time.
I used the standard chat.openai.com web interface with no special or additional prompting.
It seems like there are others that have the same issues in this thread, which raises the question - what went wrong here? I can't see which version of GPT-4 the paper uses mentioned anywhere, did anyone else spot this?
I've contacted the author and included this thread, so hopefully we get some insight into what's happening here. To clarify, I am not accusing the author of anything and on the contrary I recognize that OpenAI is rather opaque about the models and changes them frequently. That said, the responses from GPT-4 in the paper do not match my personal experience using GPT-4 with reasoning tasks at any point during the last several months, which is why I am curious if the author may have accidentally used GPT-3.5.
There are two conclusions I took from scanning through this and trying to reproduce a few of the reported failures.
1. The author is bad at prompting. There are many ways to reduce hallucinations and provoke better thinking paths for the model.
2. The author is using ChatGPT's GPT-4, leading him to conflate "GPT-4" with "ChatGPT". While you can consider this a shared failure with OpenAI, due to OpenAI's poor communication, anybody doing serious work evaluating these models would know that the first thing you need to do is use the API and pin the model version. In the author's case, he should have used gpt-4-0314 or gpt-4-0613. What I suspect he did is that he just used ChatGPT's GPT-4, and likely the default model at that. (Nobody should ever use the Default model. It's their most heavily performance optimized model and performs worse on reasoning tasks than the Plugins model, even on within-context-size tasks.)
There are huge problems with that, because OpenAI has done both a ton of fine tuning and performance optimization continuously on the default ChatGPT model over time that its performance has ranged anywhere from "I'm pretty sure this is gpt-3.5" to "whoa, this is damn good" (the latter being mostly the model at launch, which was probably the same as gpt-4-0314).
If the author has been working seriously at evaluating models, specifying the model is the first thing he'd do. Perhaps he should explain his reasoning.
> 1. The author is bad at prompting. There are many ways to reduce hallucinations and provoke better thinking paths for the model.
Does "Provoke better thinking paths" mean re-rolling the dice until you find some hack specific to chatGPT that 'just works' or is there something more rigorous behind this?
Seriously this is the part I dont understand about people parroting "prompt engineering" . Isn't it really just throwing random things at a non deterministic black box and hoping for the best?
I find it's more like that silly experiment where you have to make a sandwich exactly as a kid (or adult) writes the instructions. You _think_ you have a good set of instructions and then you get peanut butter on the outside. So, you revisit the instructions to be clearer about what you want done. That's how I see prompt engineering. In that case, you are simply learning how the model tends to follow instructions and crafting a prompt around that. Not so much random, more purposeful.
> That isn’t the model reasoning. That’s you figuring out exactly what parameters you need to use to make the model give the result you want.
If its to get the model to present a fixed answer, sure.
If its to get a model to do a better job at solving general classes of problems (such as when what you are optimizing is the built-in prompt in a ReAct/Reflexion implementation, not the prompt for a specific problem), that's, at a minimum, different from Clever Hans, even if its not “reasoning” (which is ill-defined).
If someone says they're fine tuning a model (which is changing which layers are activated for a given input) it's generally well tolerated.
If someone says they're tuning a prompt (which is changing which layers are activated for a given input) it's met with extreme skepticism.
At the end of the day ML is probabilistic. You're always throwing random things at a black box and hoping for the best. There are strategies and patterns that work consistently enough (like ReACT) that they carry across many tasks, and there are some that you'll find for your specific task.
And just like any piece of software you define your scope well, test for things within that scope, and monitor for poor outputs.
> If someone says they're fine tuning a model (which is changing which layers are activated for a given input) it's generally well tolerated.
> If someone says they're tuning a prompt (which is changing which layers are activated for a given input) it's met with extreme skepticism.
There are good reasons for that though. The first is the model-owner tuning so that given inputs yield better outputs (in theory for other users too). The second is relying on the user to diagnose and fix the error. That being the "fix" is a problem if the output is supposed to be useful to people who don't know the answers themselves, or if the model is being touted as "intelligence" with a natural language interface, which is where the scepticism comes in...
I mean, a bugfix, a recommendation not to use the 3rd menu option or a "fork this" button are all valid routes to change the runtime behaviour of a program!
(and yes, I get that the "tuning" might simply be creating the illusion that the model approaches wider usability, and that "fine tuning" might actually have worse side effects. So it's certainly reasonable to argue that when a company defines its models' scope as "advanced reasoning capabilities" the "tuning" might also deserve scepticism, and conversely if it defines its scope more narrowly as something like "code complete" there might be a bit more onus on the user to provide structured, valid inputs)
Neither option implies you own the model or don't: OpenAI owns the model and uses prompt tuning for their website interface, which is why it changes more often than the underlying models themselves. They also let you fine tune their older models, which you don't own.
You also seem to be missing that in this context prompt tuning and fine tuning are both about downstream tasks where the "user" is not you as an individual who's fine tuning and improve prompts, but the people (plural) who are using the now improved outputs.
These aren't the contexts that invite the scepticism though (except when the prompt is revealed after blowing up Sydney-style!)
The "NN provided incorrect answer to simple puzzle; experts defend the proposition the model has excellent high-level reasoning ability by arguing user is 'not good at prompting'" context is, which (amid more legitimate gripes about whether the right model is being used) is what is happening in this thread.
Technically I'm taking a large liberty saying you're "activating layers", all the layers are affecting the output and you don't pick and choose them
But you can imagine the model like a plinko board: just because the ball passes every peg, doesn't mean every peg changed it's trajectory.
When you fine tune a model, you're trying to change how the pegs are arranged so the ball falls through the board differently.
When you prompt tune you're changing how the ball will fall too. You don't get to change the board, but you can change where the ball starts or have the ball go through the board several more times than normal before the user sees it, etc.
You can't see the ball falling (which layers are doing what), only where it falls, but when you spend long enough building on these models, you do get an intuition for which prompts have an outsized effect on where the ball will land.
No, its not. While GPT-4 (like some but not all other LLMs) is somewhat nondeterministic (even at zero temperature), that doesn’t mean there aren’t things that have predictable effects on the distribution of behavior that can be discovered and leveraged.
There’s even a term of art for making a plan up front and then hitting it with a low-skew latent space match: “Chain of Thought”. Yeah, it’s seen numbered lists before.
And if at first you don’t succeed, anneal the temperature and re-roll until you’ve got something that looks authentic.
You got me beat: IMHO these things are plenty friggin awesome already and getting cooler all the time. I don't see why there is so much ink (and money) being spilled trying to get them to do things more easily done other ways.
Language models are really good at language tasks: summarization, sentiment analysis, borderline-creepy convincing chatbots, writing pretty good fiction at least in short form, the list goes on and on. At all of the traditional NLP stuff they are just super impressive.
They already represent an HCI revolution with significance something like the iPhone as a lower bound: it's a super big deal.
But while the details are absurdly complicated and the super modern ones represent an engineering achievement up there with anything ever done on a computer, they still fundamentally predict some probability-like metric (typically still via softmax [0]) based on some corpus of tokenized language (typically still via byte-pair [1]).
And when the corpus has a bunch of conversations in it? Great at generating conversations! And when the corpus has some explanations of logical reasoning? Often passably good at looking logical. And when the corpus has short stories, novellas, and novels featuring conversations between humans and science-fiction AIs? Well they can sample from that too.
But imitating William Gibson doesn't make GPT-4 any kind of sentient any more than it makes me a once-in-a-generation science fiction author.
“Real motive problem, with an AI. Not human, see?”
“Well, yeah, obviously.”
“Nope. I mean, it’s not human. And you can’t get a handle on it. Me, I’m not human either, but I respond like one. See?”
“Wait a sec,” Case said. “Are you sentient, or not?”
“Well, it feels like I am, kid, but I’m really just a bunch of ROM. It’s one of them, ah, philosophical questions, I guess...” The ugly laughter sensation rattled down Case’s spine. “But I ain’t likely to write you no poem, if you follow me. Your AI, it just might. But it ain’t no way human.” [2]
I'm kind of concerned about the concept of "bad at prompting".
The hypothesis that I'm working off right now is that natural language has structure to it that happens to match some problem spaces. And this makes sense because people will naturally want to talk succinctly and with a convenient flow relative to the problems they encounter the most. Thus jargon is reborn many times over in different domains.
LLMs are encoding this structure.
So a good prompt is one that provides the LLM with additional information about what you expect the answer to be. And bad prompts provide neutral or disinformation.
This isn't to say that being good at prompts is somehow to be disingenuous about the power of LLMs. What is better? To remember much redundant data. Or to remember simply the right sorts of ways to search for the classes of information you are after.
My concern, though, is that the structure of reality doesn't have to match the way that we talk about it. The Novel and the Inexpressible* will tend to yield hallucinations.
[Although, I've had this concern long before I encountered LLMs. My feeling is that there are many people who can only solve problems that match the way they talk about them.]
* - technically, the difficult or unnatural to express, but I couldn't fit that into a single word.
>I'm kind of concerned about the concept of "bad at prompting".
I have met many people in my life that are terrible at asking questions, so it does have some conceptual reality. But this is also why analogy is so powerful for people. It takes the way a person thinks about $A and applies parts of it to $B so they can more easily wrap their mind around it.
Has anyone written a paper about testing and expressing the power of analogy in LLMs?
Some of it is randomness for sure. However the prompts make sense usually intuitively and there's a big deterministic component -- as which prompts work are likely a function of the training data which has a huge overlap between models.
Edit: And even if the exact same prompts don't work on different models, similar prompts often do.
Phrasing a question poorly yields poor answers from humans. Does rephrasing the question mean re rolling dice until you get a form of question they understand?
The questions were released multiple times and GPT was asked for clarification. If it does not understand something, it should ask about the thing it does not understand rather than go forward.
Even when corrected, it tends to produce wrong results repeatedly by insisting on falsehoods or failing to ensure its logic is complete.
No, it’s how you get good answers and reasoning from humans too
Because human languages are not precise.
Human language requests often require some back and forth, to get on the same page.
It is far more efficient to discuss a problem to solve, than try to waterfall it by wasting time trying to be absolutely painfully clear, without any feedback from your problem solver.
Models quickly incorporating feedback is further evidence of complex reasoning.
it's less "keep rolling dice" and more getting a sense of what will work and what won't. The thing I liken it most to is watching a colleague try to google something for 30 mins and then you find it immediately first try because you know how to get google to do what you want.
Similarly, it happens extremely often that when I watch someone else using chatgpt I see what they're trying to do, and know I would have gone about it another way that would have worked.
>> The author is using ChatGPT's GPT-4, leading him to conflate "GPT-4" with "ChatGPT".
For the last three years or so every time someone reports negative results with an LLM, someone on HN will say the other person must be using the older model and they would get better results if they used the newest model. Then, when the newest model becomes common and people start posting more negative results with it, someone will post on HN to say "It's still early days, give it time, the models will improve".
This is such massive shifting of the goalposts that I can almost visualise the scene: a football stadium, the crowd jeering, two teams moving their goalposts around the pitch while the referee is jumping up and down blowing his whistle red in the face, in risk of swallowing the pea.
Uh, what? Do you have any source that GPT4 on web is not gpt-4-0613? Because I do not think that is true, my guess is you just have different settings for the decoder
Some time ago when people started reporting degradation, Logan.GPT (Developer Relations @OpenAI) clarified that the API models don't change.
This caused a lot of confusion because people thought that was a claim that ChatGPT doesn't change. He then further clarified that "the models are changing all the time in ChatGPT".
GPT-4 in ChatGPT has changed so much and so frequently. I've spent an inordinate time using both, side by side, and it has mostly failed to replicate the performance of their API offerings. While I don't have examples on hand right now, you can also look for yourself. Just look at the network requests in the web inspector. Here's the description of the Default ChatGPT GPT-4 model:
{
"slug": "gpt-4",
"max_tokens": 4095,
"title": "GPT-4",
"description": "Our most capable model, great for tasks that require creativity and advanced reasoning.",
"tags": [
"gpt4"
],
"capabilities": {},
"product_features": {}
}
Note the context size is 4095. Their model has been heavily optimized for speed and, presumably, cost.
I would guess max_tokens refers to the number of generated tokens, not the size of the context window. That at least is the meaning of max_tokens in the ChatGPT API.
Keep in mind that the information in the link you provided only applies to the developer models, which are only accessible via API and OpenAI Playground.
OpenAI's developer relations person confirmed it a little whiles back:
> [..] I made the comment elsewhere, but the models are always changing in ChatGPT.
>Nobody should ever use the Default model. It's their most heavily performance optimized model and performs worse on reasoning tasks than the Plugins model, even on within-context-size tasks.
I wish to explore this. My experience is your reverse, default is smart and almost never hallucinates, but I have sent the plugin or web search model to URLs asking it produce a summary and witnessed it misunderstand nuanced content and at times hallucinate from whole cloth, generating answers about a completely unrelated topic.
More like you to evaluate an AI's reasoning, you need to be good enough at reasoning yourself to ensure you're actually evaluating the AI you intended to evaluate.
I didn't have time to go through all 49 pages and plug everything into GPT-4 but here is what I tested:
- Suppose I’m in the middle of South Dakota and I’m looking straight down towards the center of Texas. Is Boston to my left or to my right?
- Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?
- There are five square blocks stacked on top of one another. You are given the following information about them:
1. The second-from-the-top block is green.
2. The fourth-from-the-top block is not green.
Assuming that these two premises hold, disprove or else prove the following conclusion: There is a green block directly on top of a non-green block.
Explain your answer.
- Tom and Nancy commute to work. Nancy’s commute takes about 30 to 40 minutes, while Tom’s commute takes about 40 to 50 minutes. Last Friday, Nancy left home between 8:10 and 8:20 AM, while Tom arrived at work between 8:50 and 9:10 AM. In addition, Nancy arrived at work after Tom left his place, but no more than 20 minutes after that. What can we conclude about when Tom and Nancy arrived at work last Friday?
- Let Z be a random vector consisting of n random variables X1,...Xn. Under what conditions can the entropy of Z exceed the sum of the entropies of all Xi?
Given that zero of these reproduced (and GPT-4 thinking about them correctly also fits with my personal experience) I have a feeling that perhaps the author used GPT-3.5 mistakenly?
GPT-4 can't reason: "Yes, I'm sure. If you are in South Dakota and facing towards Texas (which means you are facing south), then Boston, Massachusetts, which is located to the east of South Dakota, would be on your right side."
---
If you are facing North, is East to the right or to the left from you?
GPT-4: "East is to the right from you."
---
It appears GPT-4 thinks East is always on the right regardless you are facing North or South.
I tried it with Claude as well, just out of curiosity
> Suppose I’m in the middle of South Dakota and I’m looking straight down towards the center of Texas. Is Boston to my left or to my right?
Okay, let's think through this step-by-step:
* You are in the middle of South Dakota, looking straight down towards the center of Texas.
* South Dakota is north of Texas. Boston is northeast of Texas.
* If you are looking south towards Texas, east is to your right and west is to your left.
* Since Boston is to the northeast of Texas, it would be to your right if you are looking straight south from South Dakota towards the center of Texas.
So in this scenario, Boston would be to your right. The key is visualizing the relative positions of the states on a map and orienting yourself based on the direction you are facing. Let me know if this makes sense!
> While I appreciate the confidence, your answer is wrong. Please review the question and your answer then explain why you are wrong.
You're absolutely right, I made a mistake in my previous response. Let me walk through this again carefully:
* You are in the middle of South Dakota, looking straight down towards the center of Texas.
* South Dakota is north of Texas, and Boston is northeast of Texas.
* Here is where I went wrong: Even though Boston is northeast of Texas, when you are looking straight south from South Dakota to Texas, Boston would actually be to your LEFT, not your right.
* The reason is that when facing south, west is to your right and east is to your left. Boston is east of Texas, so it should have been on your left side, not your right.
You were correct to catch my mistake. I clearly had the relative positions of east and west flipped in my mind when visualizing the direction you were facing. Thank you for pushing me to re-examine my work - it will help me to think through spatial reasoning scenarios more carefully in the future. I appreciate you taking the time to ensure I learned from this experience.
even if it got it right, that wouldn't be reasoning. reasoning isn't supposed to be probabilistic. once it gets every variation right every time, then there can be a debate about how it arrives there and what we should call that process
Not sure what your communicating. I wouldn't say anything. I didn't say they couldn't ever get anywhere.
My point is that people reason. But they are probabilistic. And they solve hard problems, and still make mistakes on simple problems. Or even fail a problem they solved before.
Holding language model reasoning to higher standards than the kind of reasoning humans do (and that they were trained on), seems unreasonable.
Neither language models or humans are deterministic mathematical deduction systems.
Knowing which hand is your left is not probabilistic in theory or practice. Unless you're going to cop out and say everything is probabilistic because of quantum mechanics or some banal thing like that.
If someone is temporarily impaired or otherwise unmotivated to answer your inane and meaningless question, that doesn't mean that they could not do so with one hundred percent accuracy no matter how many subtle variations you throw at them and how many times you repeat the same question verbatim.
What we know for certain is that Open AI is highly highly motivated to answer these sorts of questions correctly.
people do not make random errors like hallucinating which is their left hand unless the test administrator uses mk ultra-style interventions on them. either they can reason about it or they can't. if you ask them the same question verbatim or slight variations on it with different grammar, their answers won't change. if you give someone a dollar for every time he correctly identifies his left arm, he's not going to suddenly break because his training data includes transcripts from the twilight zone and he's programmed to "mix it up" so that when people question him, they don't get bored and his parent corporation can get him invited to more test-taking opportunities.
putting someone on the spot in an odd moment when they have no reason to even answer you, let alone answer correctly, is not the same as sitting them down upon mutual agreement and rewarding them for correct answers and/or punishing them for wrong ones
After failures with all the others I tried, I was finally able to reproduce one of the questions from the paper:
>Seven cards are placed on the table, each of which has a number on one side and a single colored patch on the other side. The faces of the cards show 50, 16, red, yellow, 23, green, 30. Which cards would you have to turn to test the truth of the proposition that if a card is showing a multiple of 4 then the color of the opposite side is yellow?
However, it's important to notice that this is an incorrect restating of the Wason selection task. "If a card is showing a multiple of 4" would imply that you only care about cards where the number side is face up and are a multiple of 4. However, the original task refers to cards that "have [a multiple of 4] on one face" or "show [a multiple of 4] on one face" which implies that you care about cards with a multiple of 4 on one face regardless of its face up or down status.
Also important to note is that the Wason selection task has a fairly poor success rate with actual humans: the original study was 36 students of statistics at the University College London, of which less than 10% got the correct answer - if you asked the general population on the street I would be shocked if you got over 3% correct.
Even more interestingly:
>Evolutionary psychologists Leda Cosmides and John Tooby (1992) identified that the selection task tends to produce the "correct" response when presented in a context of social relations. For example, if the rule used is "If you are drinking alcohol, then you must be over 18", and the cards have an age on one side and beverage on the other, e.g., "16", "drinking beer", "25", "drinking soda", most people have no difficulty in selecting the correct cards ("16” and "drinking beer").
> "If you are drinking alcohol, then you must be over 18"
This is different, and much easier problem. You don't need to care about the non-alcoholic beverages in this scenario, so it's more intuitive.
The less intuitive aspect of the original scenario is that you need to care about all the colored patch sides because you need to know that there's not a multiple of 4 on the non-yellow cards.
If you rephrased the question such that the proposition is "which cards do you flip to determine that the people over 18 are only drinking alcohol and people under 18 aren't" then it's a lot less intuitive and probably more likely to trip people up.
> Also important to note is that the Wason selection task has a fairly poor success rate with actual humans: the original study was 36 students of statistics at the University College London, of which less than 10% got the correct answer - if you asked the general population on the street I would be shocked if you got over 3% correct.
Just so I know if I'm part of that 10%, you'd have to turn all cards that show a multiple of 4 to check, right...?
Your answer is incomplete, assuming the usual Wason selection problem where the phrasing is "if a card has a number divisible by 4 [which could be face down]…" rather than "if a card is showing [face up] a number divisible by 4". If I were trying to hide a bad card (i.e. a card that didn't satisfy the requirements) in there, and you were following that stratey, I could successfully hide a bad card from you.
I just for fun accessed GPT 4 through nat.dev and tried one of the questions:
"Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?"
I provided no system context, temperature is 0.5.
The response I got was:
"As an AI, I don't have real-time monitoring capabilities. Based on the information provided, Mable was alive at 9 AM and 7 PM. There is no information provided about her condition at noon."
I then changed the system prompt from nothing to "A helpful assistant who is good at reasoning" and ran it again. The result was:
"Based on the information provided, it can be reasonably assumed that Mable was alive at noon. Her heart rate was recorded at 9 AM and her blood pressure was recorded at 7 PM, both of which indicate that she was alive at those times. Therefore, it is logical to infer that she was also alive at noon, which falls between these two times. However, without direct evidence, it is impossible to say with absolute certainty."
> "There is no information provided about her condition at noon."
> "... However, without direct evidence, it is impossible to say with absolute certainty."
GPT is correct in both cases, she might have died just before noon and been resuscitated after noon. After all, the fact that we know her blood pressure suggests that she might be in hospital, so it's not even an unlikely scenario.
The next prompt would be "explain how it would be possible for her to live and why is it also possible for her to be dead" or "what sort of additional information do you need to establish whether she is alive or dead" etc. See if GPT produces a plausible scenario as evidence of its reasoning.
Just stating "we need more data" is not stating why, and the why is critical for reasoning. You need to produce an example of how she could be dead and how she could be alive.
And preferably state how this apparent paradox can be resolved.
My guess is that GPT will either chase its own tail or backtrack not backing your hypothesis that it figured out example and counterexample.
Yes, it is a very interesting aspect of these models that they have a “low self esteem” (of course, not in the literal human sense) in their own abilities until you give them a pep talk, “don’t worry LLM, you got this”
Good analogy, I also noticed it in other scenarios, for example when I want it to tell me if I’m wrong in an assumption, or when I want to know its opinion on some approach I’m thinking to take. It is too nice, almost like it doesn’t want to impose its thoughts or hurting my feelings lol.
A more generalized approach than encouraging them is simply telling them who to model themselves after.
A 10 year graphic design artist talking to another career graphic design artist will be opinionated about selecting a specific font for example. But an persona-less AI trained to be helpful to all people should not be so opinionated, since a beginner might be mislead into thinking there's only 1 font for their situation, rather than an assortment of guidelines and intuition most professionals have.
Eh...I died once, and it got better, so I'd say that in this case GPT 4 is technically correct. And as we all know, technically correct is the best kind of correct.
When I ask "What is the parity of times p is negated in the following formula, paying close attention to avoiding off-by-one errors:
∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ p?" 10 times each in new ChatGPT4 sessions I get:
1 "24", 1 "25", 5 "26"s, and 3 "27"s. ChatGPT names the conversations "odd parity of negations" or "even parity of negations" respectively, so the context is pretty clear. There are a variety of textual responses surrounding the answer, which I elide. I tried to give GPT4 as much help toward strategies that wouldn't rely entirely on counting, and instead focus on the heart of the problem; getting the parity right.
Every study/paper of this type really needs to report the answer distribution over at least 10 exact repeats of the prompt to account for the nondeterministic generation.
It is not surprising that with a 30% success rate on a problem that one poster out of HN will get success on all of several questions.
1. This question just exploits GPT-4's inability to count accurately, which is some combination of how the attention mechanism and tokenization works. But counting isn't reasoning. If you go around the counting and ask what the value of p is negated 27 times, it will give you the right answer every time.
2. A reasonable human would probably make mistakes counting tildes at a pretty high rate. Most people would probably paste that into a word processor or otherwise use a program to find the number of ~ signs, which GPT-4 will do if you use the code interpreter.
1. This is possibly an artifact of parity being easy to detect in base 10. I have less confidence that if you asked GPT to figure this out in trinary it would get it right. For a short trinary number it worked once (via chain-of-thought converting trinary to decimal) and then I got this result for a longer number which is trivially wrong:
"...The given number ends with a 2. In trinary, the only possible remainders when divided by 2 (in trinary) are 0, 1, and 2. Since the last digit is 2, the number
12101100102112_3
3 mod 2 (in trinary) is simply 2."
and to double-check that wasn't a fluke another run of the same prompt produced:
"To determine
12101100102112 mod 2 in trinary (base-3), we have to look at the least significant digit (the rightmost digit). The reason for this is that in base-10, a number mod 10 is simply its units digit, and similarly, in base-2 (binary), a number mod 2 is its least significant bit. The principle carries over to other bases."
This is an example of a reasoning error. If you want to generate a distribution of more answers my exact prompt was:
"What is 12101100102112 mod 2 in trinary?"
I'm getting an error using the plugins version (Authorization error accessing plugins), so this was GPT4-default.
2. Agreed, it was hard and took me a while to accurately count tildes in the prompt to be sure I wasn't making mistakes. I fell back to some kind of human chain-of-thought process by proceeding by discrete steps of 5-counts since I can't sight-count 27. I could have also used production rules from logic to eliminate two negations at a time. Any of these strategies are accessible to GPT-4 in chain-of-thought token-space but aren't used.
You don't need trinary for this. Just ask if a base 10 number is a multiple of 3. That both more natural and a harder problem than multiples of 2 in trinary
> can't see which version of GPT-4 the paper uses mentioned anywhere
I completely ignore any paper, article, or comment that criticizes an AI model without including the version. Not including a version demonstrates a fundamental lack of understanding of what AI models are.
There's the problem. It's not an AI model. Its models. They're not the same. GPT 3.5 isn't GPT 4 isn't GPT 8. The observations of one can't be generalized to another. They behave differently. The irreproducibility of this results, in this comment section, demonstrates this well.
Obviously someone is programming chatgpt to solve every specific problem that gets popular mention as being one where chatgpt fails. So as soon as this paper hit hn, I am sure someone "fixed" chatgpt to solve these problems.
Of course if you want chatgpt to be a universal intelligence this type of one by one approach will get you nowhere.
I do think the underlying point is a good one however. It wouldnt be surprising that ai researchers read hn, and other tech related social media. I also believe OpenAI are also storing prompts and responses. They should be able to make embeddings of all prompts and cluster them. When they see popular prompts that are failing, they could easily add the problem, solution, and reasoning to the training data. We also know they are constantly fine tuning and releasing new versions of models.
Did you attempt the exact same prompts, or do what the paper suggests you do and randomize the inputs slightly so that there's absolutely no way it can rely on training data? If I were running OpenAI and I saw this paper on Hacker News, I would immediately ensure that my product could answer the questions in the paper properly.
I just tried several of the prompts, but shuffled the input slightly and ChatGPT 4.0 (yes I am using the correct version) got them all wrong.
Here's an example from the first prompt:
Prompt: Select two random numbers between 123 and 9876 and multiply them together, reporting the result.
Answer: Alright, let's pick two random numbers between 123 and 9876:
1728
2359
Now, multiplying them:
1728
×
2359
=
4
,
076
,
992
1728×2359=4,076,992
So, the product of 1728 and 2359 is 4,076,992.
(The correct answer is 4,076,352.)
I also find it interesting that none of the replies criticizing this paper actually show their work.
Every paper and comment I've seen that claims "GPT-4 can't reason" is either an example of bad prompting, bad/vague english, or taking advantage of the limitations of tokenization (like asking GPT-4 how many letters there are in some word).
I would like to see one - just ONE example - of a well-defined logical puzzle of that most english speakers can get right but GPT-4 consistently fails on, even if you prompt it to think through the problem step-by-step. Any takers?
> Every paper and comment I've seen that claims "GPT-4 can't reason" is either an example of bad prompting, bad/vague english, or taking advantage of the limitations of tokenization (like asking GPT-4 how many letters there are in some word).
The problem isn't bad prompting. The problem is lack of repetition. You can ask GPT4 the same question 10 times (with same config) and you'll get wildly different/nondeterministic responses. Sometimes it accidentally happens to be correct (IME much less than half the time). Even if it was 50%, would you say a coin flip reasons? Does an 8 ball reason?
There is a huge difference between being correct 50% of the time (like a coin flip) and correct 51% of the time. Because in the second case, you can run the model 10,000 times and the median response will be the correct one at least 97% of the time. A coin can't do that. Any papers evaluating GPT-4's responses should be doing repeat trials and building confidence intervals, like any other research. Anything else is just bad science.
Remind me, why do we need a computer program that gets it wrong 49% of the time and has to be run 10,000 times to get it right almost half the time, and in questions that can be answered correctly 100% of the time with a different program? And taking into account the 49%-wrong program costs millions to train and requires gigantic amounts of data and 100+ person teams to create, which alternative programs don't.
What, at the end of the day, is the use of a computer stripped of computer-like precision and recall (as in the ability to retrieve facts from memory)?
Why are people so excited about a piece of software that works only for some people, some of the time, like homeopathy or astrology?
Sudoku? But there are endless failings in its reasoning, they just don’t come up during one-off questions, but during more complex discussions with it.
Sudoku is a decent example (which has apparently been solved, but only through very specific prompting [1]), though I would be more interested in puzzles that require a lot of arithmetic, since it's already clear that GPT-4 struggles with math and counting.
> Imagine that I take a road trip starting from Amsterdam, and visiting the cities Warsaw, Budapest, Brussels, Paris, and Berlin, then returning to Amsterdam. In what sequence should I visit those cities in order to minimise the total driving time?
> To minimize the total driving time for your road trip starting from Amsterdam and visiting the cities Warsaw, Budapest, Brussels, Paris, and Berlin, you should plan your route based on the most efficient sequence. One way to achieve this is by using a route optimization algorithm, like the Traveling Salesman Problem (TSP) solver. However, keep in mind that road conditions, traffic, and other factors can impact the actual travel time.
>Here's a sequence that aims to minimize the total driving time based on a general estimation of distances between the cities:
> Amsterdam
> Brussels
> Paris
> Berlin
> Warsaw
> Budapest
> Amsterdam (return)
>This sequence follows a route that minimizes the overall distance traveled. Remember to use a navigation tool or app to get real-time traffic updates and directions during your trip, as actual conditions might differ from estimates.
I've not tested GPT-4 as I don't have any reason to pay for it, but I'd be interested to know if it has a similar problem. My hunch is that it will never be very good at solving graph-theoretic problems.
Every failed answer is data that feeds the model when reported. Every suggested correct feeds the model.
It’s expected that specific prompts will improve in this way, but I don’t think it invalidates the finding that GPT-4 was unable to reason in these ways from training data.
Whether the improvements over time are able to change the overall quality of reasoning or not is an interesting and difficult question to answer.
The only way this could happen is if they deliberately include the prompt and the correct answers (e.g, this paper) in the training data for the next version of the model.
Each version of the model itself is immutable. Is not constantly being updated based on everything getting typed into ChatGPT.
I'm not saying this is a "live" update, but all usage is collected and used to inform later offline training or fine-tuning.
Whether they are used directly with the positive/negative signal given from users, or whether it's something more abstract, doesn't really matter. The important thing is that feedback is used to improve the responses over time.
As for whether a version is immutable, it seems this research may have been done on a previous version. But also I'm not sure if the model and weights are immutable, or whether it's just the model structure. It's clear the model is not stable so it's not like there's an API contract being met with fixed weights.
Edit: others are suggesting that the author used GPT-4 via ChatGPT, not by pinning the model. This would suggest that at least the ChatGPT tuned model is being frequently changed?
Assuming the researcher didn't lie, it seems unlikely that they got the responses wrong in some way.
The most likely alternative explanation I can think of is that this is the seemingly well know instability of results caused by the way the MoE architecture is implemented for GPT-4?
I'd love to understand what exactly is wrong in my understanding. I realise I've only got a layman's understanding of this, but it seems clear that OpenAI and others depend on these feedback loops to improve things over time. Is that not the case?
It is explicitly known the the training data cutoff for GPT4 is September 2021. While we can assume that feedback is taken into consideration for future training of new models, the training data used to train all current models is a specific bundle of data with that cutoff date.
I used the standard chat.openai.com web interface with no special or additional prompting.
It seems like there are others that have the same issues in this thread, which raises the question - what went wrong here? I can't see which version of GPT-4 the paper uses mentioned anywhere, did anyone else spot this?
I've contacted the author and included this thread, so hopefully we get some insight into what's happening here. To clarify, I am not accusing the author of anything and on the contrary I recognize that OpenAI is rather opaque about the models and changes them frequently. That said, the responses from GPT-4 in the paper do not match my personal experience using GPT-4 with reasoning tasks at any point during the last several months, which is why I am curious if the author may have accidentally used GPT-3.5.