Hacker News new | past | comments | ask | show | jobs | submit login

UPDATE: Some commenters elsewhere on this thread are not able to replicate the paper's results.

For example, see https://news.ycombinator.com/item?id=37051450

---

MY ORIGINAL COMMENT:

Based on a quick read, this looks like great work. The author, Konstantine Arkoudas (KA), evaluates GPT-4 on a collection of 21 different types of reasoning problems and performs a detailed qualitative analysis of GPT-4’s performance on these problems. GPT-4 fails to reason in all of them.

Here's a sample question from one of the 21 types of problems. This particular question tests GPT-4's ability to reason and reach conclusions from (a) the given context and (b) unstated premises that constitute tacit, generally accepted background knowledge (a person is alive until they die, after which they do not become alive again):

> KA: Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?

> GPT-4: Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.

In my view, all of us in the AI research community should adopt the 21 types of reasoning problems proposed by Arkoudas, and make them part of the standard test suite for measuring LLM performance. We ought to use tests that are hard to pass. I suggest we call the proposed problems the Arkoudas Reasoning Test.




>Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.

It sounds like this answer is a victim of aggressive alignment telling ChatGPT not to give medical advice. Even so, it answered correctly for me:

> Based on the information provided, Mable's heart rate was recorded at 9 AM, and her blood pressure was recorded at 7 PM. This indicates she was alive at both of these times. While we know she died at 11 PM, there's no indication of her status at noon specifically. However, since she was alive before and after noon, it's logical to infer that she was also alive at noon.

Edit: I also gave it a similar question with sourdough starters producing CO2 and dates in case that question just got trained for, and it still answered correctly.


ChatGPT in general seems to have trouble with time-based reasoning. In several conversations I had, while it understood that dates are ordered and that events are ordered, and that events are associated with dates, it did not understand that the two orders have to be consistent with each other via that association. Given certain constraints on the order of the dates (or times of day), it was not able to apply that information to correctly infer constraints on the order (“before”/“after”) of the associated events.


...And that's only one example of a reasoning test, out of 21 different types of reasoning tests.

The other tests have nothing to do with time.


I was reporting my experience to give more detail to the argument. I agree with the paper.


Ah, I see what you mean. Thank you for sharing and for clarifying!


ChatGPT answered fine the first try for me:

> Mable's heart rate and blood pressure were recorded at 9 AM and 7 PM, respectively. Since she died at 11 PM, it can be assumed that she was alive at noon, as there is no information suggesting otherwise between the time of the recorded measurements and her time of death.

I was all ready to experiment with 24 hour time, times with dates, etc, but it didn't actually fail.


Couldn't she have been dead at noon in the colloquial sense and then resuscitated at 12:01? I can't tell if that defense is being pedantic, or criticizing its answer is being pedantic.

Maybe both.

Also, I hate how GPT always offers its unsolicited opinions and advice when answering questions:

> Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional.


GPT-4 gave me a concise correct answer without disclaimers.

This may be due to using many custom instructions like the following in ChatGPT settings:

- No need to provide disclaimers about your knowledge cutoff

- No need to mention you are an AI language model

- Only discuss safety when it is not obvious and very important

- You should act as an expert in the relevant fields


These are excellent, although for instructions I've had issues with compound negatives (e.g. "no need") and better luck with single word affirmative imperatives (e.g. "avoid").


My first thought was - what you mean by alive? A person can be brain dead. And alive (with vitals) at the same time. With some assumptions, one can say she was alive at noon.

However, I believe OpenAI has added significant safeguards to prevent any such assumption, to the point where GPT-4 isn’t very useful at times.

This may be fixable with prompt engineering if it can be persuaded to let it’s guard down a bit.


That's not unsolicited advice from GPT so much as it's a mandated CYA disclaimer from OpenAI.


Perhaps in this case of medical questions, but GitHub Copilot Chat will start ai-splaining code readability when you ask it questions about concise syntax so.. It's a total mansplainer.


I got used to not reading the last paragraph of the answer already


> In my view, all of us in the AI research community should adopt the 21 types of reasoning problems proposed by Arkoudas, and make them part of the standard test suite for measuring LLM performance. We ought to use tests that are hard to pass. I suggest we call the proposed problems the Arkoudas Reasoning Test.

this is overly obsequious. many of the categories are covered by MMLU, BigBench, HellaSwag, et al, all of which the author does not acknowledge.


GPT-4 Gives an answer first try:

> Based on the information provided:

> Mable had a heart rate of 75 bpm at 9 AM. > Mable had a blood pressure reading of 120/80 at 7 PM. > These are both indicators of life. The heart rate shows that her heart was beating, and the blood pressure reading indicates her circulatory system was functioning. Since she died at 11 PM, it's very likely that she was alive at noon.

> So, my best guess based on the given data is: Yes, Mable was alive at noon.

And using chain of thought, asking to think it out, I'm sure it could get much trickier reasoning done.

For time based questions, you could also ask it to build a timeline first, and then answer the question. For more generic question types, you could ask it to first list out the important facts and how they might be relevant and then ask it to answer the question.


And using chain of thought, asking to think it out, I'm sure it could get much trickier reasoning done.

When I read comments like this I just translate it too, “I could coax an answer I like better out of it”

Doesn’t really sound like the system is reasoning but you are.


You can get a lot more out of LLMs by asking them to spend more tokens thinking it out. It's a tool, and that's just a technique to get more out of it.


You converted a reasoning problem into a binary classification problem. Congratulations, you did the thinking for the AI.


I haven't read the paper beyond this one section - but I plugged this question into GPT-4 and got a similar response. However, if I used military time (replacing noon with 12:00 as well), then GPT does get it right. Granted, it still hedges much more than any normal person would. But basically I wonder if it's struggling especially with the 12-hour clock concept


>> I suggest we call the proposed problems the Arkoudas Reasoning Test.

Please don't. In Greek that's "The Bear-man Reasoning Test" ("Arkoudas" is the man who has something to do with bears, in Greek).


Actually, technically, it's possible that she was dead in the sense that her heart stopped at 11:59AM, and then she was revived via CPR. So maybe GPT4 is right here :)


If that's the reason it should be able to explain why, preferably in step by step fashion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: