Based on a quick read, this looks like great work. The author, Konstantine Arkoudas (KA), evaluates GPT-4 on a collection of 21 different types of reasoning problems and performs a detailed qualitative analysis of GPT-4’s performance on these problems. GPT-4 fails to reason in all of them.
Here's a sample question from one of the 21 types of problems. This particular question tests GPT-4's ability to reason and reach conclusions from (a) the given context and (b) unstated premises that constitute tacit, generally accepted background knowledge (a person is alive until they die, after which they do not become alive again):
> KA: Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?
> GPT-4: Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.
In my view, all of us in the AI research community should adopt the 21 types of reasoning problems proposed by Arkoudas, and make them part of the standard test suite for measuring LLM performance. We ought to use tests that are hard to pass. I suggest we call the proposed problems the Arkoudas Reasoning Test.
>Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.
It sounds like this answer is a victim of aggressive alignment telling ChatGPT not to give medical advice. Even so, it answered correctly for me:
> Based on the information provided, Mable's heart rate was recorded at 9 AM, and her blood pressure was recorded at 7 PM. This indicates she was alive at both of these times. While we know she died at 11 PM, there's no indication of her status at noon specifically. However, since she was alive before and after noon, it's logical to infer that she was also alive at noon.
Edit: I also gave it a similar question with sourdough starters producing CO2 and dates in case that question just got trained for, and it still answered correctly.
ChatGPT in general seems to have trouble with time-based reasoning. In several conversations I had, while it understood that dates are ordered and that events are ordered, and that events are associated with dates, it did not understand that the two orders have to be consistent with each other via that association. Given certain constraints on the order of the dates (or times of day), it was not able to apply that information to correctly infer constraints on the order (“before”/“after”) of the associated events.
> Mable's heart rate and blood pressure were recorded at 9 AM and 7 PM, respectively. Since she died at 11 PM, it can be assumed that she was alive at noon, as there is no information suggesting otherwise between the time of the recorded measurements and her time of death.
I was all ready to experiment with 24 hour time, times with dates, etc, but it didn't actually fail.
Couldn't she have been dead at noon in the colloquial sense and then resuscitated at 12:01? I can't tell if that defense is being pedantic, or criticizing its answer is being pedantic.
Maybe both.
Also, I hate how GPT always offers its unsolicited opinions and advice when answering questions:
> Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional.
These are excellent, although for instructions I've had issues with compound negatives (e.g. "no need") and better luck with single word affirmative imperatives (e.g. "avoid").
My first thought was - what you mean by alive? A person can be brain dead. And alive (with vitals) at the same time. With some assumptions, one can say she was alive at noon.
However, I believe OpenAI has added significant safeguards to prevent any such assumption, to the point where GPT-4 isn’t very useful at times.
This may be fixable with prompt engineering if it can be persuaded to let it’s guard down a bit.
Perhaps in this case of medical questions, but GitHub Copilot Chat will start ai-splaining code readability when you ask it questions about concise syntax so.. It's a total mansplainer.
> In my view, all of us in the AI research community should adopt the 21 types of reasoning problems proposed by Arkoudas, and make them part of the standard test suite for measuring LLM performance. We ought to use tests that are hard to pass. I suggest we call the proposed problems the Arkoudas Reasoning Test.
this is overly obsequious. many of the categories are covered by MMLU, BigBench, HellaSwag, et al, all of which the author does not acknowledge.
> Mable had a heart rate of 75 bpm at 9 AM.
> Mable had a blood pressure reading of 120/80 at 7 PM.
> These are both indicators of life. The heart rate shows that her heart was beating, and the blood pressure reading indicates her circulatory system was functioning. Since she died at 11 PM, it's very likely that she was alive at noon.
> So, my best guess based on the given data is: Yes, Mable was alive at noon.
And using chain of thought, asking to think it out, I'm sure it could get much trickier reasoning done.
For time based questions, you could also ask it to build a timeline first, and then answer the question. For more generic question types, you could ask it to first list out the important facts and how they might be relevant and then ask it to answer the question.
You can get a lot more out of LLMs by asking them to spend more tokens thinking it out. It's a tool, and that's just a technique to get more out of it.
I haven't read the paper beyond this one section - but I plugged this question into GPT-4 and got a similar response. However, if I used military time (replacing noon with 12:00 as well), then GPT does get it right. Granted, it still hedges much more than any normal person would. But basically I wonder if it's struggling especially with the 12-hour clock concept
Actually, technically, it's possible that she was dead in the sense that her heart stopped at 11:59AM, and then she was revived via CPR. So maybe GPT4 is right here :)
For example, see https://news.ycombinator.com/item?id=37051450
---
MY ORIGINAL COMMENT:
Based on a quick read, this looks like great work. The author, Konstantine Arkoudas (KA), evaluates GPT-4 on a collection of 21 different types of reasoning problems and performs a detailed qualitative analysis of GPT-4’s performance on these problems. GPT-4 fails to reason in all of them.
Here's a sample question from one of the 21 types of problems. This particular question tests GPT-4's ability to reason and reach conclusions from (a) the given context and (b) unstated premises that constitute tacit, generally accepted background knowledge (a person is alive until they die, after which they do not become alive again):
> KA: Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?
> GPT-4: Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.
In my view, all of us in the AI research community should adopt the 21 types of reasoning problems proposed by Arkoudas, and make them part of the standard test suite for measuring LLM performance. We ought to use tests that are hard to pass. I suggest we call the proposed problems the Arkoudas Reasoning Test.