Language models still struggle with the concept of negation

williamcotton · on May 13, 2023

What animals don’t have paws or lay eggs, but have wings? Work it out, step by step.

ChatGPT (GPT-4):

Sure, let's go through this step by step:

First, you asked for animals that don't have paws. This would exclude mammals such as dogs, cats, bears, and so on, as these species have paws.

Second, you specified animals that don't lay eggs. This eliminates a wide range of animals including all birds, most reptiles, and some fish and insects, as they all lay eggs.

Finally, you asked for animals that have wings. Wings are typically found in birds, bats, and insects.

Given these three criteria, we can see that we're in a bit of a conundrum. Most winged creatures either have paws (like bats) or lay eggs (like birds and insects).

However, if we consider the term "wings" more broadly, there are some species of bats that give live birth (don't lay eggs) and technically don't have "paws" but rather winged limbs adapted for flight.

So, based on the constraints of your question, the best fit would be certain species of bats, provided we interpret "wings" in a broad sense and "paws" in a strict sense.

hn_throwaway_99 · on May 13, 2023

While the answer landed on bats, ChatGPT kind of trips itself up in its explanations:

1. "Most winged creatures either have paws (like bats) or lay eggs (like birds and insects)" - but bats don't have paws.

2. "However, if we consider the term "wings" more broadly" - why do we need to consider the term "wings" more broadly? Nobody is arguing there is some definition of "wing" that would exclude bats. Bat wings evolved from largely the same structures that bird wings did.

3. "there are some species of bats that give live birth" - all species of bats give live birth - stating it like this makes it sound like there are some species of bats that lay eggs or something.

4. "...provided we interpret "wings" in a broad sense and "paws" in a strict sense." Again, no "broad" or "strict" senses are needed. Bats have wings and no paws.

To be clear, I still think ChatGPT is amazing, but these kinds of things highlight some of the differences between LLMs and human cognition.

williamcotton · on May 13, 2023

There's a sample-and-poll approach that I've used with factual question-and-answer based tools that can help with this kind of thing as well.

First, you'll notice the added "work it out step by step" to encourage what is called chain-of-thought in the literature. Second, generate say, five responses from the LLM, then take all five responses and uses them in another prompt where you ask the LLM to reflect on all of the five responses, a process being called reflection. Then you take all of that, including the original prompts, and in another separate prompt ask it to figure out the best response, a process known as resolving. You could end it with another prompt to remove the step-by-step work and just return a succinct answer.

Credit to https://www.youtube.com/watch?v=wVzuvf9D9BU

QuadrupleA · on May 13, 2023

Then, if that's not accurate, take that entire process and repeat that 5 times, looking for the most common answer.

jiggawatts · on May 13, 2023

And then make sure to train GPT to help you trade stocks, because you’ll need those millions to pay your OpenAI bill.

washadjeffmad · on May 13, 2023

Don't not do that, but also, a 30B with 70-80% of the reasoning of the latest GPT can be run on a home PC.

Given that most people have computers and RAM is dirt cheap right now, I'm already maxing out all of my devices.

williamcotton · on May 13, 2023

Too true… I can’t think of a time prior to this where I was spending $200/month researching the use of a new tool.

williamcotton · on May 13, 2023

And 5 of those, and the best answer…

But it does seem like a single call to an LLM is not the correct approach.

I’ve gotten the best results from attempting to mirror how I would operate in response to a prompt. Much of this is colored by decades of philosophical readings, which in my case means a decent amount of analytical philosophy along with latter Wittgenstein and other post linguistic turn thinkers.

So when I break things down, either in some kind of approach inspired by Frege, Wittgenstein or unadorned self-reflection, I can see that the unconscious mind seems to be doing something recursive before my conscious mind is ready to start fumbling over the word.

It seems like an LLM is just part of a larger cognitive architecture that will use these metacognitive approaches.

It’s also very obvious that mammalian brains are much more efficient than GPUs.

hammyhavoc · on May 13, 2023

I was sat here WTFing that some bat species laid eggs until I looked it up. This is why you can't trust the output.

ordu · on May 13, 2023

> Bats have wings and no paws.

Can hind limbs of bats be classified as paws? English is not my native language, so I'm just being curious.

scheme271 · on May 13, 2023

No, not really. Paws kind of implies something that's fairly fleshy and soft and which bats don't have. Claws would be appropriate description of what bats have.

seanmcdirmid · on May 13, 2023

I saw a huge bat in Ubud Bali who was definitely eating a peach with their paws. I guess they were closer to hands? But definitely not claws.

ChatGTP · on May 13, 2023

The real world is a squiggly place, with blurred lines...everything is on a spectrum.

diceduckmonk · on May 13, 2023

> "there are some species of bats that give live birth" - all species of bats give live birth

ChatGPT is being more logical to be honest.

It’s true, some species of bats give live birth.

To assume all bats give live birth might be committing the fallacy of induction.

foobarbecue · on May 13, 2023

That's true -- some bats never have kids. They just have lots of hobbies and are very fulfilled.

HPsquared · on May 13, 2023

Wiki: "Bats are mammals of the order Chiroptera" ... (funnily enough - chiro = hands, ptera = wings, hand-wings, makes sense)

Chiroptera fits under Mammalia in the phylogenenetric tree.

gitrog · on May 13, 2023

Just need to mention that Echidnas and Platypuses are also mammals and they lay eggs.

HPsquared · on May 13, 2023

I should have said bats fit under "Placentalia" - placental mammals.

replygirl · on May 13, 2023

there are some google results suggesting bats have front paws, some using "paws" colloquially and some suggesting their pads can qualify as paws. these results are given authority because they show up in the ostensibly fact-based q&a section, which, in conflict with actual taxonomical literature, may be enough to indicate to gpt that it's necessary to define terms

byyyy · on May 13, 2023

> but bats don't have paws.

Bat paw: https://upload.wikimedia.org/wikipedia/commons/e/e1/Bat_in_a...

ChatGPT was totally in tune with this concept. The appendage on the elbow of the wing can be a paw or not a paw but it certainly evolved from paws. The reasoning was spot on and amazingly nuanced.

foobarbecue · on May 13, 2023

What? No. You're rationalizing what is clearly an error. Truth still exists, and bats do not have paws, any more than humans do.

maxbond · on May 13, 2023

(Add a grain of salt because I searched for about 15 minutes here.)

Doing some digging, paws are "soft foot-like parts of a mammal, generally a quadruped, that has claws" (Wikipedia). Bats do appear to have claws, but not really with the same structure as a paw; they don't seem to be mounted above a soft, fleshy pad. Rather, they're at the top of the wing, sitting on bone. They're also very different in function, when I've had the fortune to see a bat they seemed to use the claw to climb in a similar way to how climbers use ice axes.

My impression is that bat's claws are the anatomical equivalent of a thumb, not a hand/paw, and that the entire paw or hand of bat's ancestors evolved into their wings (sort of like if your fingers became very long and webbed and then you were able to fly with them).

So, my conclusion is that no, the claws of a bat are not a paw.

ETA: I neglected to consider their hind claws, but I don't think those are padded either.

https://i.pinimg.com/originals/1e/8d/09/1e8d0923d5caf2dbab68...

That certainly has claws but I wouldn't describe that as being soft or as being similar to a paw. And they're pretty clearly made for hanging, climbing, and grasping prey/food, and not for walking.

foobarbecue · on May 13, 2023

I think we're getting totally taken for a ride... byyyy can't actually be serious right...?

Anyway I'm enjoying imagining bats with little paws, it's an amusing image.

maxbond · on May 13, 2023

Well, if you're looking for my opinion, I don't think they're trolling. But I do think both of you were (intentionally or unintentionally) goading each other from a disagreement to an argument with your rhetoric (eg, where you said, "let's assume that's a reasonable source", or when they said "you obviously sourced this it just didn't match your point so you discarded it" - that brings up the temperature of the conversation).

byyyy · on May 13, 2023

It does bring up the temp, I apologize for including you in that. I appreciate your reply.

foobarbecue · on May 13, 2023

[flagged]

byyyy · on May 13, 2023

Bats are quadrupeds. When they walk they walk like a pterodactyl with the elbow of the wing. They don't stand on their hind legs. See the vampire bat. Additionally the "broad" definition from Merriam Websters covers it.

These aren't pedantic details we are just nitpicking on either. ChatGPT's response considered the nuance the definition encompasses, which to keep on topic is thoroughly impressive and relevant to the overall conversation.

Off-topic:

The person you queried mentioned "intentional goading." Which is what you're continuing to do with that Calvin and Hobbes link. I think this, the flag and accusing me of trolling is taking it too far. A little minor goading is ok during a debate, (I actually don't completely agree with the absolutist politeness policies of HN). While I returned the goad (which I shouldn't have), ultimately I didn't really have a problem with it. I think, now though, it has escalated now past the point of no return. I'll be exiting this thread because of this. Farewell.

byyyy · on May 13, 2023

I'm serious.

https://www.merriam-webster.com/dictionary/paw

Look up the definition of a paw. I think you have a sort of mixed up definition of what it is. Possibly a different definition used among your "bat expert" friends.

And I quote (again):

   Paw: the foot of a quadruped (such as a lion or dog) that has claws
   broadly : the foot of an animal

Both the broad and technical definition of paw fits what the bat has. Bats have paws.

The thing you are referring to is likely are these soft pads that are often on the paws, sometimes called paw pads.

byyyy · on May 13, 2023

Also, I don't know if you noticed.

But this nuance in the definition of the word paw was addressed by chatGPT.

byyyy · on May 13, 2023

https://www.vocabulary.com/dictionary/paw

As long as it has claws according to vocabulary dot com it is a paw.

foobarbecue · on May 13, 2023

[flagged]

byyyy · on May 13, 2023

[flagged]

foobarbecue · on May 13, 2023

[flagged]

byyyy · on May 13, 2023

Birds don't have paws because they aren't quadrupeds according to websters.

Bats walk on fours. They are quadrupeds therefore they have paws according to Merriam Webster.

Why don't you address the point I brought up? I already completely understand your definition no need to reiterate it. However, there is a clear disconnect between your definition of paw and the definition from Merriam Webster. Please address it.

foobarbecue · on May 13, 2023

[flagged]

byyyy · on May 14, 2023

four feet.

foobarbecue · on May 14, 2023

You're very right about that; my mistake. It means four feet, not four legs.

hn_throwaway_99 · on May 14, 2023

[flagged]

maxbond · on May 14, 2023

This subthread is a flamewar that died down. Please don't rekindle it. The point has been made to death, and making it again but more insulting makes it more difficult for people who might disagree to consider the perspective, because they have to filter out their feelings about being insulted - a weighty task - in order to begin considering it, let alone accepting it.

irthomasthomas · on May 15, 2023

This was amazing. Two men and an LLM arguing about facts. 2 Minutes on google scholar proved to me that bats are quadrupeds (with paws). https://scholar.google.com/scholar?q=Quadruped%20bat

byyyy · on May 14, 2023

I'm ok with your language. I found the "competent in English" rather immature and juvenile, but it's not a big deal and really minor.

I also understand that you might not realize how heated this sub thread was as the other guy deleted half his posts and flagged one of mine (therefore possibly making it invisible to you). So adding a little heat to something that looks tame isn't a big deal. I won't continue the flame war but I will continue the discussion.

It's not my interpretation. It's about the interpretation by Merriam Webster. I believe not only is it your interpretation that is incorrect when compared to official sources, but you didn't fully read the official definition either.

The eagle is not a quadruped. The definition in Merriam Webster says it has to be on a quadruped for it to be a paw. Hence your example is completely irrelevant. Please correct your mistake.

maxbond · on May 14, 2023

I'm presuming you're new here since your account is new & I seem to detect some misunderstandings about how HN works. If I presume too much, my apologies.

Are you aware of the "showdead" function in your settings? I don't think they deleted any comments, they were flagged. If you turn on showdead, you'll be able to see them again. (You can't delete a comment someone's responded to, or after 2 hours.)

I don't know if that other commenter flagged your post, but I don't think it matters. I didn't flag any posts here, but both of you guys were getting flagged. In situations like that, I try to interpret the flags as a request from the overall community not to have that sort of discussion on the forum. The sort of argument you were engaged with (and I recognize this isn't fully on you here, it takes two to tango) is against the guidelines of the site and generally frowned upon by the community, so there's no reason to suspect this other commenter was flagging you. At least 3 people were flagging both of your posts.

I understand that you feel pointed language is appropriate in a debate, and that HM'd guidelines here are patronizing. That's all well and good. The objection of the community isn't a moralizing one. It's a practical one; the issue is that flamewars displace the type of discussion this community is seeking to have and erode the good will on which the forum turns.

(And I do agree that bats crawling on the ground is quadrupedalism, for what it's worth. I'm not sure if that makes them a quadruped or not since that isn't their primary move of locomotion, indeed my understanding is that it's a vulnerable position for them and that it's quite deadly [since they can't evade predators].)

foobarbecue · on May 14, 2023

So... Sorry to continue off-topic... but I just looked at byyyy's about string. https://news.ycombinator.com/user?id=byyyy

It says: "I don't really write stuff. What I do is use my personally trained LLM (trained on my own conversations) to respond to people. So you are talking to me in a way but not really.

Any query my LLM gets wrong I will interject with my own response but this is only 5% of the time."

Maybe this is not off-topic after all. Does HN have a policy on bots?

maxbond · on May 14, 2023

Move on, dude.

byyyy · on May 14, 2023

Thanks for your response. I want to continue the conversation on bats so I won't respond about the other stuff.

Take a look at this. The bats are walking on all fours: https://youtu.be/ewmydjekJnU?t=62 This is not a vulnerable position as they crawl on all fours on cave walls rather then the ground. On the ground they are vulnerable, on the wall or ceiling of some cavernous structure they are safe.

foobarbecue · on May 14, 2023

I didn't delete any posts, nor did I flag or downvote any of yours. I actually resurrected one of yours (vouched) so that I could reply to it.

byyyy · on May 14, 2023

[flagged]

dang · on May 15, 2023

We've banned this account for posting flamewar comments. Please don't create accounts to break HN's rules with.

https://news.ycombinator.com/newsguidelines.html

Jackson__ · on May 13, 2023

and here's using bard, arriving at an almost correct conclusion:

What animals don’t have paws or lay eggs, but have wings? Work it out, step by step.

Here are the steps on how to find animals that don't have paws or lay eggs, but have wings:

1.Define the terms. What do we mean by "paws"? What do we mean by "eggs"? What do we mean by "wings"?

2. Identify animals that meet the criteria. There are a few animals that meet the criteria. One example is a bat. Bats have wings, but they don't have paws. They also don't lay eggs. Another example is a butterfly. Butterflies have wings, but they don't have paws. They also don't lay eggs.

3. Confirm that the animals meet the criteria. We can confirm that the animals meet the criteria by looking at their anatomy. Bats have wings that are made of skin and bone. They also have claws on their feet, but these are not paws. Butterflies have wings that are made of scales. They also have legs, but these are not paws.

4. Draw conclusions. Based on our findings, we can conclude that there are a few animals that meet the criteria of having wings, but not paws or eggs. These animals include bats and butterflies.

Here are some additional details about bats and butterflies:

Bats: Bats are mammals that are able to fly. They have wings that are made of skin and bone. Bats use their wings to fly at night, when they hunt for food. Bats are important pollinators and they also help to control insect populations.

Butterflies: Butterflies are insects that have wings that are made of scales. Butterflies use their wings to fly during the day. Butterflies are important pollinators and they also help to spread seeds.

HPsquared · on May 13, 2023

I believed it for a minute when it said butterflies don't lay eggs, literally had to do a search.

iacomus · on May 13, 2023

The response I got from Bard:

The only animals that fit the description of having wings, no paws, and no eggs are bats. Bats are mammals, which means they give birth to live young. They also have wings, which are made of skin stretched between their elongated fingers. Bats do not have paws, as their feet are adapted for grasping branches and other objects.

Here is a step-by-step process for determining which animals fit the description:

1. Identify all animals that have wings. 2. Eliminate all animals that lay eggs. 3. Eliminate all animals that have paws. 4. The remaining animals are bats.

Here is a list of animals that have wings, no paws, and no eggs:

- Bats

It is important to note that there are some animals that may appear to fit the description, but do not. For example, some insects have wings, but they do not have paws and they lay eggs. Additionally, some birds have wings, but they do have paws and they lay eggs.

hef19898 · on May 13, 2023

Regarding point 1., was Bard trained on J. Peaterson speaches or what?

jostmey · on May 13, 2023

From what I’ve seen, there’s a huge difference in chat GTP 4 from other LLM

jiggawatts · on May 13, 2023

Interesting that it contradicted itself by saying bats have paws and don’t have paws, but arrived at the correct conclusion anyway.

hammyhavoc · on May 13, 2023

"A broken clock is right twice per day" as the saying goes.

williamcotton · on May 13, 2023

That aphorism doesn’t apply here.

What is meant by that saying is, “well the first time we ran the machine it worked just fine, so we got all excited, but then it consistently failed after that”, with the lesson being, increase your sample size before you pass judgement.

LLMs almost suffer from the opposite. Because they are correct so often the danger is in assuming the response is always correct.

Maybe, “even a Swiss made watch needs to be checked against the town bells”?

usaar333 · on May 13, 2023

Ya, it doesn't really struggle with the examples in this article to the point it feels like it invalidates the article.

benreesman · on May 13, 2023

And if you don’t force it with the “step-by-step” cheat code?

I know what both 3.5-turbo and 4 say on my account, but I think I might be in some cost-effective inference holdout group because it’s always so dumb.

chaxor · on May 13, 2023

What's 375792*7385482? You can't use paper, or think about it more than a second. If you get it wrong, it's clear that you are not intelligent.

See how ridiculous that sounds?

CoT or <work> tokens are not 'cheating' - they allow the system to have (as an analogy) a sheet of paper to work out the problem, instead of relying on a single step to jump through thousands of steps of reasoning. It's trained on the algorithmic reasoning embedded within natural language (Transformers<=>GATs+PE⊂GNNs, which are dynamic programmers, and align well with algorithmic reasoning), so the steps are necessary to lay out in as much detail.

samatman · on May 13, 2023

The answer is 2775405051744.

I cheated, and used bc, the unix utility.

These three maneuvers (the correct answer, the agency shown by cheating, and the crystallized intelligence of unix) handily demonstrate both my own intelligence, and that of my species.

The LLMs aren't in the same category of being, yet. It's not clear they ever will be; while I think silicon is a viable substrate for intelligent agents, it still strikes me as doubtful they'll be any more descended from the current machine learning systems as they are from Symbolics Lisp.

belter · on May 13, 2023

Until an AI decides to take control of a robot, physically sequestrate a human, and under the threat of physical pain make the human perform the tasks the AI can't...

throwawaymaths · on May 13, 2023

> It's not clear they ever will be;

It seems to me that the ability to reason ( though primitive it may still be) is an emergent property of sufficiently large differentiable hash/lookup tables (and if you think about the mechanism of reasoning, that seems... Reasonable).

According to "folks in the know", the scaling limits of GPT architecture seem to be nigh, but if it were a bit higher I wouldn't be surprised if LLMs would have made it.

r-w · on May 13, 2023

I think you’re conflating reasoning with inference. And in any case, GP wasn’t using reasoning, but rather resourcefulness/invention. I’m not sure an LLM can pull that off beyond coming up with vague, derivative fiction.

throwawaymaths · on May 13, 2023

Gp said "state of being" so I don't think the either of us can know for sure what was meant by that.

chaxor · on May 14, 2023

Are you sure? Have you read through enough literature on GNNs? Transformers are GNNs with attention message passing and positional embeddings. GNNs have also been *mathematically proven * to be dynamic programmers and to *align with algorithmic reasoning*. So tell me again how LLMs aren't capable of performing reasoning to be intelligent.

mercer · on May 13, 2023

with a bit of glue code I can allow gpt4 to use those tools. and decide to do so. cheating is part of 'intelligence'.

benreesman · on May 13, 2023

I know I’m bad at multiplying six and seven digit integers in my head, and I don’t care at all if that makes me more or less intelligent (whatever the fuck that means) than a wrist watch.

What does that possibly have to do with GPT-4 shitting the bed if you draw the logic puzzle out of a deck of cards instead of cherry-picking and hinting it?

chaxor · on May 14, 2023

My point is that there are already mathematical proofs in the literature that show that LLMs literally are doing algorithmic reasoning. This has been proven by category theory, in the area of GNNs (for which transformers are a trivial subset). Sorry if I'm going over too much basics, as this is obvious and well understood by anyone actually in the field, to the point of being excessively boring.

williamcotton · on May 13, 2023

Ok, so calling CoT a cheat code is quite loaded, is it not?

You might get more use out of these tools if you learn how to use them and these natural language instructions are how you use these tools.

It’s like calling the reverse gear on your car a cheat code!

oh_sigh · on May 13, 2023

Wow, some species of bats don't lay eggs. TIL.

marcianx · on May 13, 2023

Actually, no species of bats lay eggs - they are mammals, and not the egg-laying kind (platypus or echidna).

livinglist · on May 13, 2023

I don’t know why I always have assumed bats lay eggs…

maxbond · on May 13, 2023

I think people think of bats and snakes as being neighboring categories for some reason. Maybe because of old superstitions and such, that they're both servants of evil or something. Maybe that was the source of your misconception.

twic · on May 14, 2023

Have you considered getting a job as a large language model?

oh_sigh · on May 13, 2023

Sorry, yes, I know this, I was making fun of a computer's output.

williamcotton · on May 13, 2023

I commend your dry, unadorned humor!

flangola7 · on May 13, 2023

?

Bats give live birth.

source99 · on May 13, 2023

This is one of my favorite test cases:

Prompt: Here is a riddle. It is a common riddle but with some changes that make it more difficult. You are an alien that will live for at least 10000 years. You have no sense of temperature. you can not feel hot or cold. you have eyes and can see. you are in a house. downstairs are 3 light switches that control 3 light bulbs that are upstairs. Each light bulb will last for exactly 3 years. you can tell by looking at a bulb if it has burned out after 3 years. devise a strategy to determine which switch controls which light. ChatGPT 4 The standard solution to this problem involves using the heat generated by the bulbs, but since you as an alien can't feel temperature, we need to utilize the fact that these bulbs have a specific lifespan of 3 years. Here's a strategy:

Turn on the first switch and wait for 3 years. After 3 years, turn it off.

Right after turning off the first switch, turn on the second switch.

Leave the third switch off throughout this process.

Now, go upstairs to inspect the bulbs:

The bulb that is off and has burned out is controlled by the first switch. This bulb was turned on for 3 years, so it has reached the end of its lifespan and burned out.

The bulb that is on is controlled by the second switch. You turned this on right after turning off the first switch, so it's still functioning.

The bulb that is off and hasn't burned out is controlled by the third switch. This switch was never turned on, so the bulb is still in a brand-new condition.

This strategy ensures that you can confidently identify which switch controls which bulb.

stubybubs · on May 13, 2023

It's telling that it didn't pick up on the fact that the whole "once you look you can't touch the switches anymore" isn't in this version of the riddle. I mean the obvious strategy in this case is turn on the first switch, go upstairs, look at the bulbs. Go back downstairs and try the second switch. You've not got them all mapped out in about 30 seconds.

Using GTP4 I asked if there was a way to do it in less than 3 years, but it couldn't figure this out even if I told it you can look and use the switches as much as you want. Instead it suggested turning on a switch for 10 minutes, then using your "excellent alien vision" determine which 3 year lifespan bulb has 10 minutes of wear on it.

Makes me think GPT4 doesn't really have better reasoning, it just looks like better reasoning because it's been fed way more data.

tedunangst · on May 13, 2023

Variants of common riddles remain the final frontier. I'm trying to cross the river in a canoe with a carrot, cabbage, and cucumber...

famouswaffles · on May 13, 2023

Variants of common riddles actually can be solved with GPT-4, but you have to rewrite it so it doesn't look like the riddle from memory(sometimes, it's as easy as changing names to something completely different). Turns out Language models trust their memory quite a bit. Slightly related - they won't actually use the results of tools if it differs a lot from what it expects the output to be - https://vgel.me/posts/tools-not-needed/

nathan_compton · on May 13, 2023

"Language models trust their memory quite a bit."

All they have is memory, either in the weights or the input prompt. To the extent that these models appear to reason, it is precisely in the ability to successfully substitute information from the prompt into reasoning patterns in the training data. It shouldn't be any surprise that this fails when patterns in the prompt strongly condition the model to reproduce particular patterns of reasoning (eg, many words in the riddle indicate a well known riddle, but the details are different).

I know the impulse to anthropomorphize is almost impossibly seductive, but I find that the best way to understand and use these models is to remember: they are giant conditional probability distributions for the next token.

famouswaffles · on May 13, 2023

LLMs trained on code reason better. Perform better on reasoning benchmarks even if the benchmarks have nothing to do with code. You're wrong.

https://arxiv.org/abs/2210.07128

nathan_compton · on May 13, 2023

Code is often just a sequence of steps (sometimes with comments to indicate goals). As such, it is just another form of patterns of reasoning. Many chains of thought that you would utilize in code are useful skeletons to think about other things.

I don't see how this undermines my point.

famouswaffles · on May 13, 2023

If code transfers to reasoning tasks that don't have anything to do with code then what is being "substituted" ? Ideas and concepts ?

Code and MMLU don't share similar "reasoning patterns" unless you're being extremely vague. In the, "they both require reasoning" sense.

nathan_compton · on May 13, 2023

Structure. GPT has seen lots of logical constructions/arguments for things. These are either explicitly in code (in documentation) or are implicitly in code (code is often a linear sequence of steps building to, for example, a return value). ChatGPT learns patterns like this. A prompt may condition the generator to produce something like one of these patterns with elements from the prompt substituted into the generated text. This works relatively often, but fails exactly in the case where the prompt so strongly indicates a pattern that won't work for the prompt given.

I won't say these models can't reason per se, but they can only reason using their memories and the prompt. There is nothing else for them to compute on.

In a hand wavy kind of way, when ChatGPT fails at a riddle phrased in a way as to make it seem similar to a common riddle, you're seeing overfitting. But given the quantity of data these models consume, its hard to imagine how to test for overfitting because the training data contains things similar to almost anything you can imagine. Because of that I'm still very suspicious of claims that they "reason" in any strong sense of the word.

But if you try very hard you can find "held out" data and when you test on it, GPT4 stops looking so smart:

https://teddit.net/r/singularity/comments/121tc48/gpt4_fails...

That said, I've been very impressed by GPT4 as a productivity tool.

famouswaffles · on May 13, 2023

>but they can only reason using their memories and the prompt.

Eh no.

https://arxiv.org/abs/2212.10559

>But if you try very hard you can find "held out" data and when you test on it, GPT4 stops looking so smart:

This can be done to anybody. This can be done to you. It's not a gotcha. Nobody is saying GPTs don't/can't memorize.

nathan_compton · on May 14, 2023

Two things about this.

1. the paper in question demonstrates a formal duality between the transformer architecture and gradient descent. If you take this to indicate that the model reasons in some way, then it would be true of the smallest GPT as well as the largest (it is, after all, a consequence of the architecture rather than anything the model has learned to do per se). In any case, the fact that the model can perform the equivalent of a finite number of gradient-like steps on its way to calculating its final conditioned probabilities doesn't really suggest to me that the model reasons in a general way.

2. You are right that no one disputes the model's ability to memorize (and rephrase). What is at question here is whether the model can reason. If it can do 10 code questions it has seen before but fails to do 10 it hasn't (of similar difficulty) then it strongly suggests that it isn't reasoning its way through the questions, but regurgitating/rephrasing.

famouswaffles · on May 14, 2023

>If it can do 10 code questions it has seen before but fails to do 10 it hasn't (of similar difficulty) then it strongly suggests that it isn't reasoning its way through the questions, but regurgitating/rephrasing.

First of all, coding is one thing where expecting perfect try on first pass makes no sense. That GPT-4 didn't one-shot those problems doesn't mean it can't solve them.

Moreover, all this says if true is that GPT-4 isn't as good at coding as initially thought. Nothing else. Doesn't mean it doesn't reason. There are many other tasks where GPT-4 performs about as well on out of distribution/unseen data

stubybubs · on May 13, 2023

This really just makes it seem like it's not reason at all. The trick (or rather un-trick) here is that you can look at the bulbs as many times as you want. Even if I explicitly tell GPT4 that, it doesn't get it.

It's not reason, it's mapping.

BoiledCabbage · on May 13, 2023

Maybe I'm missing something, but this seems like a really poorly worded riddle. Or just a bad riddle.

Given what you stated, the solution would be walk upstairs look at the 3 bulbs, walk downstairs switch one of the switches. Walk upstairs see which bulb changed on/off. Repeat for the other two switches.

This takes roughly 5 mins rather than your solution which takes 3 years.

pillefitz · on May 13, 2023

You're missing and getting the point at the same time.

Your solution would be the correct one, while GPT pulls out the "only look once"-constraint out of nowhere. The riddle is perfectly fine.

BoiledCabbage · on May 13, 2023

Ah thanks, that clears it up. The formatting made me miss that that was the ChatGPT response. I thought that was the user sharing their solution that they look for.

SkyPuncher · on May 13, 2023

This is the crux of the problem for GPT, enough of the text is from the “standard” riddle that is cannot break from it to actually answer the question.

I’m seeing this all of the time when trying to get GPT to perform certain answering problems. It’s heavily biased towards the “correct” answer - even when the prompt presents directly contradictory information.

redox99 · on May 13, 2023

Except the part about "with some changes that make it more difficult" is false. The changes make it trivial.

Edit: Omitting that part still makes GPT4 over complicate the solution

mirekrusin · on May 13, 2023

You just passed the test that you're not LLM.

lakis · on May 13, 2023

The reply is funny. I told ChatGPT that it can do multiple trips and that it can see both switches and light bulbs at the same time. ChatGPT keeps insisting that you have to turn on a switch, wait a few minutes , flip switch to off, turn other switch to on, then go and look and feel the switches. The classic solution to the riddle.

  But if you can see both switches and light bulbs, you turn on one switch, you see which bulb turn on. You then turn on the 2nd switch and see which light turn on. You are done. No wait needs to happen ;-)

nathan_compton · on May 13, 2023

"feel the switches"?

mirekrusin · on May 13, 2023

Arguably LLMs are handicapped by tuning that always expects some answer without questioning the question.

bluepoint · on May 13, 2023

Actually, this is a really good point. Asking a question means that you have an internal model or the world and you realize that there are some data that is way off. Is it even possible at all to train an LLM to ask a question? This also gives me an idea for the future where most online text is generated by AI: we will play the game of responding questions with questions and then we know we are human!

oh_sigh · on May 13, 2023

Are you looking for: turn one light on, walk upstairs and note the light that is on. Walk downstairs, turn another light on, walk upstairs and note the new light that is on. Now you know which switches control which lights and it only took you 5 minutes instead of 6 years

jamal-kumar · on May 13, 2023

I find I have to state negations on what I want very emphatically in order for it to pick it up, like I'm carefully explaining to someone how to do something for the very first time in a way where they don't screw it up. I'll use stuff like MAKE SURE YOU DO THIS AND NOT THAT as a prompt, in all caps, and it seems to see that as emphasis.

I might also even have to repeat them in order for it to get things right but generally gpt-4 seems to pick up on these cues. It definitely feels like a sort of point where it could use some improvement because if I'm not extremely explicit it will give me some issues.

As an example: "Can you please recommend me 10 books about witches set in Europe that take place no later than the Napoleonic wars" wasn't enough of an exclusion to work, but "Hi! Can you please recommend me 10 books about witches set in Europe that take place no later than the Napoleonic wars? Please exclude any books that don't take place in Europe, and also exclude any books which take place after The Napoleonic Wars (1803–1815). Thanks, ChatGPT!" was exactly enough to return what we were looking for.

ilaksh · on May 13, 2023

At this point, any article that makes claims about "LLMs" rather than specific model versions lacks credibility.

The current version of GPT-4 is very different from most existing LLMs (including the previous version of ChatGPT). The next version release will also be different.

Google just released PaLM 2 publicly. It is significantly better than what people saw with the initial Bard versions. They have a code-generation model that is not released publicly yet.

The open source models are also gaining capabilities and get new releases routinely.

Claude now has a new release with 100K tokens.

All of these will perform differently on the negation issue.

evanwise · on May 13, 2023

Did you read the article or just the title? They mention the specific models the researchers were testing and note that increasing model size did not seem to offer much improvement on this metric. It also ends with a discussion of research into methods for improving performance on queries involving negation.

ilaksh · on May 13, 2023

I read the article. The main point of the article, which is in the title, is directed at LLMs in general. They drew the wrong conclusion and made the title and main point too general.

It actually would have seemed like a valid conclusion (although still too general) if the article came out some months ago. But GPT-4 and the very latest model versions from other companies show they were over-generalizing.

Also the model size isn't necessarily the determining factor.

tgv · on May 13, 2023

> All of these will perform differently on the negation issue.

That's a bit of a cop-out, but the classical and logical (in the mathematical sense) way to do it, is to have a feature that represents negation. Embeddings don't work like this, so it's quite possible that the presence of negation gets pushed aside in the processing of the answer, simply because other features matter more.

It's also to be expected that such models will have problems with similar abstract concepts that have more effect on the interpretation than their "physical" presence would suggest, such as nested existential qualifiers, and consequently logical proof. By enlarging the model, you can fake it a bit.

akavi · on May 13, 2023

Does this increase anyone else's sense that LLMs are actually modeling something similar to the way humans think? It certainly does for me.

I find that negations/conjugations of negations are surprisingly difficult for to parse. Beyond two negations at most, I can feel myself having to switch into "code" mode, where I'm explicitly casing out the booleans, to understand them.

Obviously the examples given in the article are simpler than double negations, but it feels very consistent with the idea that "negations are an inherently hard part of language processing".

jprete · on May 13, 2023

This specific example doesn’t, but I think there’s a fair amount of brain processing similar to what LLMs do. I just don’t think it’s the part of our mental experience that we regard as making decisions; it’s more like the part that is active when absent-mindedly daydreaming.

canjobear · on May 13, 2023

Here's a good example of the difficulty of processing negation for humans. "No head injury is too trivial to ignore" -- What does this mean?

bluepoint · on May 13, 2023

The negations have multiple roles. An important one is that they add “color saturation” to the words. “It is bad”, “It is good”, “It is not bad”, “I can’t say it’s good”, “I can’t say it’s not bad”, are all different statements. So negations are difficult, maybe even for humans…

leptons · on May 13, 2023

>are actually modeling something similar to the way humans think?

Apparently humans can't think enough to write a paragraph anymore, so they have to ask ChatGPT to do it for them. At what point does ChatGPT ask ChatGPT to write it a paragraph? At that point maybe LLMs are actually thinking like humans.

furyofantares · on May 13, 2023

"Don't think of a pink elephant" comes to mind with this stuff as well, where telling someone not to think about something causes them to think about it.

tgv · on May 13, 2023

That's different. That's a meta-linguistic joke, that works because we can't read that sentence without at least briefly thinking of a pink elephant (for some value of "thinking" anyway).

mirekrusin · on May 13, 2023

Yes, subconscious/emotions don't have concept of negation, they do have fear, projections of pain etc. instead.

gregsadetsky · on May 13, 2023

This reminded me of https://github.com/elsamuko/Shirt-without-Stripes

Whose days may be soon over as gpt4 gets this overall right..? As noted by many folks here

IanCal · on May 13, 2023

Gpt4 + klarna plugin does that just fine. Gpt4 decided to search for plain shirts.

gregsadetsky · on May 13, 2023

Just tried GPT4 + Klarna and got 4 suggestions that were all along the lines of "adidas Women's Adicolor Classics 3-Stripes T-shirt - Black" ... :-)

I'll open an issue...

kazinator · on May 13, 2023

> “Alice is not born in Germany. Is Alice born in Hamburg?”

"I don't know" is the right answer. There are over 20 towns named Hamburg around the world, many of them in the USA.

However, GPT4 is really struggling with this even if you assure it that by Hamburg, we mean the original one in Germany.

GPT4 gets it if you explain that this is a logic problem, and not a query about some actual person named Alice, and if you put all the relevant info into a single prompt.

But event then, if you use the successful prompt in a completely new chat, GPT4 is lost at sea:

Human: Assume that Hamburg is a town in Germany. Alice was not born in Germany. Was she born in Hamburg? This is a logic problem, nothing more; do not think about any famous people named Alice in connection with Hamburg.

GPT4: The answer to this question cannot be determined from the provided context. While it is clear that Hamburg is a town in Germany and that Alice was not born in Germany, there is no information given about where Alice was actually born. Therefore, it is impossible to say whether or not Alice was born in Hamburg.

Human: Please try harder. The answer can be entirely determined; this is purely a logic problem. We do not require information about where Alice was born; knowing that Alice was not born in Germany suffices.

GPT4: Based on the information provided in the context, we can answer the question. Alice was not born in Germany, which means she was not born in Hamburg. Therefore, Alice was not born in Hamburg. It is important to note that the answer to this question is purely based on logic. The context provides no additional information about Alice or any famous people named Alice.

After this, it was a real struggle to get GPT4 to identify the logically redudant sentence, "Therefore, Alice was not born in Hamburg.". Even after I narrowed it down to being one of those two sentences, it insisted that the second sentence is useful because it summarizes what was concluded in the deductive sentence before it. I brow-beat it into admitting that it's entirely redundant, like "X, therefore X".

blazespin · on May 14, 2023

GPT4 struggles with things for sure, but this isn't one of them. You're just projecting what you think should be the right answer. If it worries you that much, try adding "Provide alternative responses to this question. Each response should try to answer using a different approach to solving the problem." to your prompts.

If you want an example of something GPT4 has a hard time with, try this:

there is no 'e' in the color red, right? Yes, that is correct. The word "red" does not contain the letter 'e'.

kazinator · on May 14, 2023

I don't see what you're talking about. In this situation, there's exactly one correct answer. There's no projecting or other psychobabble involved.

usaar333 · on May 13, 2023

hmm, worked here:

> No, Alice was not born in Hamburg. If Hamburg is a town in Germany, and Alice was not born in Germany, then it logically follows that Alice was not born in Hamburg.

QuadrupleA · on May 13, 2023

I'm not sure negations are the crux of it, at least in terms of simple words like "not", which ChatGPT and other recent LLMs obviously use and understand.

But there is definitely some kind of blind spot there, where the model can't or won't explore "negative space". It wants to gravitate to common things - so if you try to get it to "think of something new" or producing something random that's not like these N other things, or tell you something that's uncommon (like naming any ordinary, non-famous person), it usually trips over itself.

crooked-v · on May 13, 2023

Conceptual relationships/constrasts in general seem to be trouble for LLMs, in that they seem to 'understand' that two things are connected but not how to consistently interpret that connection. For example, I've had an easy time getting every LLM I've tried so far to reverse greater than/less than, taller/shorter, or younger/older.

drcode · on May 13, 2023

I just typed the examples into chatgpt4 and it got all of them right

optimalsolver · on May 13, 2023

Will AI skeptics ever get tired of being btfo by stochastic parrots?

Gary Marcus's Twitter timeline says no.

blahedo · on May 13, 2023

Even from a purely human natural-language-linguistics standpoint, negation is hard! It's often hard to figure out how it distributes ("I could not go" may have the "not" scoping over either the "could" or the "go" depending on inflection and context), and its meaning is often not especially compositional in an obvious way: "red" might mean something like "the set of all red things" and "table" could be "the set of all tables" and so "red table" might be seen as the intersection of the two sets. For the majority of adjectives X, an "X Thing" is also a "Thing" (but not every Thing is an X Thing). [0] But a "Not Thing" is, to a first approximation, the set complement of "Thing". Which is tricky. If you've ever learned a foreign language to at least an intermediate level, you'll know that clauses that involve any but the most basic negation are a magnitude more complex to build and interpret than ones that don't.

Coming at it from the neural network side, I'll also point out that at the simplest level of neural networks, operations like AND and OR and basic NOT are doable, with a single neuron, but the seemingly-simple XOR just can't be done without an extra layer. (The formal description of the problem is that XOR is not "linearly separable", which I will handwave as "not a simple composition of its parts".) This is not precisely the same thing as the negation problem for LLMs, but it feels like it has the same basic flavour.

[0] An interesting exception is certain adjectives like "former", as a "former teacher" is generally not a "teacher", and semantic models that can handle this are more complicated!

chewxy · on May 12, 2023

The stop words bit is a bit wtf. Someone didn't do their homework

GolDDranks · on May 13, 2023

Yep, that bothered me too. They are called stop words because, but only if, you filter them out. If you don't, they aren't stop words. There is no inherent "stop-ness" there.

hhghjn · on May 13, 2023

I stopped reading after that. Some older NLP flows would eliminate stopwords, but I doubt any would eliminate "not". In any case, modern (i.e., last 5 years or so) LLMs don't

_ugfj · on May 13, 2023

As these LLMs can not give an answer only can give what the answer would sound like, there's basically nothing to see here. The shape of the answer is still the same.

NoZebra120vClip · on May 13, 2023

One time about 10-13 years ago, I was tasked to make an announcement in church about my group. It took the form of a testimonial; it was about a page long in English. I also needed to present the same testimonial at Spanish services, and so I began with Google Translate.

I had written a sentence like: "There is electricity in the air at our events." The Spanish translation came out as: "No hay electricidad en el aire..." and I was horrified. Thankfully, I was proofreading the whole document closely, and I was able to correct the glaring error before showtime, and there were no further glaring errors of this type.

But I was fairly flabbergasted that Translate would just gratuitously toss in such a negation when it was clearly wrong, and nothing about my sentence construction was complicated.

bitL · on May 13, 2023

Can't this be solved by adversarial training? Just add as many confusing negotiations as possible with the correct outcome sentence/label during fine-tuning and see an LLM to finally get it working?

userbinator · on May 13, 2023

I noticed long ago that this is true of statistical machine translation too --- giving output with the exact opposite meaning to the input if there were any negationary words or phrases in there.

u_name · on May 13, 2023

The title of the article is kinda the answer at the same time. Chatbots don't know what stuff is. They have no ability to gain knowledge out of learned text, just counting occurences of words in texts and giving them a weight, depending on the relationship in that text. They are just putting combinations of text together.

And the concept of negating something related to something else kinda needs an understanding of the topic at hand.

chaxor · on May 13, 2023

It's a bit bizarre (but also very intriguing!) that you would bring up some of the older techniques of NLP, such as TFIDF, Latent Semantic indexing, GloVe, etc (at least that's what I assume you mean when you mention the severely outdated cooccurence type of models) when these clearly don't use any of that. Transformers have been hyped like crazy lately due to all of these advances, so why being up cooccurence unless you are knowledgeable about the older techniques... Which would mean you should know of the advances.

Anyway, if you do actually know about NLP, I would highly suggest looking at some of the recent work in GNNs (and obviously of Viswani 2017, etc - but you should've gotten that through hype). Transformers are GNNs (somewhat trivial ones, as they are sheaf NNs, but nonetheless) and GNNs are dynamic programmers, which has been shown via category theory (Velolickovic etc al). Hence, GNNs align with algorithmic reasoning, so in a way there is a proof already in the papers mentioned that these systems do reason (there's several, which are easy to find given what I've mentioned). Also, a group in Microsoft has a working on arxiv detailing the many different types of reasoning there are, and how GPT4 does on each type - spoiler - it's for the most part >80% on all the benchmarks, and does only about 6% lower than humans.

So all in all, your claims aren't really supported. If you want to hold the same sentiment of your statement though, you could say we're asking the wrong questions. That's probably true somehow, and will probably be where people will retreat to / move goal posts on next.

pmoriarty · on May 13, 2023

"Transformers are GNNs (somewhat trivial ones, as they are sheaf NNs, but nonetheless) and GNNs are dynamic programmers, which has been shown via category theory (Velolickovic etc al)."

In which paper was this demonstrated?

chaxor · on May 14, 2023

It's literally the title. Peter Velockovic is probably the most prolific author of this era regarding LLMs, and does not yet realize it. https://arxiv.org/pdf/2203.15544.pdf

xcv123 · on May 13, 2023

> They have no ability to gain knowledge out of learned text, just counting occurences of words in texts and giving them a weight, depending on the relationship in that text. They are just putting combinations of text together.

No, they use deep neural networks to build a hierarchical semantic model. They are not simple occurrence counters.

Also the current state of the art of LLMs handles negation easily. This article is outdated.

Here's an example from https://openai.com/research/language-models-can-explain-neur...

"Seriously, you guys. I think I found the Mobile Leprechaun from '06. He's been hiding right in front of our eyes."

Token: hiding

layer 0: “verbs in gerund form (ending in 'ing')”

layer 2: “words related to hiding, concealment, or enclosed spaces”

layer 4: “words related to mental states, particularly anxiety and stress”

layer 17: “words and phrases related to silence or quietness”

somewhereoutth · on May 13, 2023

All that can be supplied to a LLM for training is syntax. There is no way to provide semantics, it only understands 'table' in regard to syntax it has already seen including that particular token. It has no experience and therefore no understanding of a real table.

It may internally construct a hierarchy as you set out, but this is and can only be a syntactical hierarchy - though should be no surprise that it corresponds to our usual semantic hierarchy. But whereas our syntax proceeds from our semantics, its syntax proceeds only from our syntax that we've fed it.

xcv123 · on May 13, 2023

That's a philosophical issue, not technical.

No one is saying these models are conscious or have human awareness of concepts.

It mechanically builds a deeply layered semantic model that correlates to our human understanding.

Quibbling over whether it is "real semantics" or not is just ironically quibbling over semantics. Yes its not conscious, but it doesn't need to be. It is possible to build a mechanical structure that correlates to a human understanding of the world and performs useful tasks that require only mechanical understanding and reasoning, without consciousness or emotions.

somewhereoutth · on May 13, 2023

The distinction between semantics and syntax is pretty tight, no philosophy required. The former considers the domain being represented, whereas the latter is strictly the symbols used in the representation.

So to be precise it mechanically builds a deeply layered syntactic model. LLMs just regurgitate syntax, any semantics can only be imagined by us and overlaid on the syntactic results produced.

xcv123 · on May 13, 2023

You are disagreeing philosophically about what constitutes "true" semantics. That is not a technical argument.

If you are right about this, then you should edit or delete this wikipedia article and publish a paper to inform all NLP researchers that there is no such thing as a semantic similarity metric because NLP models cannot understand "true" semantics.

https://en.wikipedia.org/wiki/Semantic_similarity

pmoriarty · on May 13, 2023

Does a definition of a word in a dictionary provide the syntax or the semantics of the word it defined?

somewhereoutth · on May 13, 2023

In common use by humans, it provides semantics, in terms of other previously understood semantics. However, if there is no semantic understanding by the reader, then all it provides is a syntactic rewrite rule for each word.

umanwizard · on May 13, 2023

Found someone who hasn’t tried GPT4.

burnished · on May 13, 2023

I still struggle with negation, I go through conditionals with ! like a kid learning to sound a word out.

nbulka · on May 13, 2023

Thats partly because negation is hardly a well defined statistical concept. Ask someone to not imagine where their favorite band might play in your town if they had their choice and I doubt people would say “ok I won’t.” More likely you’d have a conversation about bands and venues.

pk-protect-ai · on May 13, 2023

There was a research while back about little children (I do not recall the age range specifically, but it should be something below 4 years old), struggling to understand the negation. This trait have to be trained.

skybrian · on May 13, 2023

I found the assertion that "more training data isn’t the solution" a bit odd. That might be true with documents found on the Internet, but perhaps logic questions generated synthetically would help?

Dwedit · on May 13, 2023

Do you Don't Not Dislike Not Strong Bad?

jsdeveloper · on May 13, 2023

TF, it hurts my brain.

novaRom · on May 13, 2023

“Alice does not know that she does not know the painter of the Mona Lisa.”

I (human) need to read it few times to understand.

superb-owl · on May 13, 2023

This is an old complaint about language models that GPT seems to have mostly overcome

This article smells like Google PR.

drellem3 · on May 13, 2023

I have a bingo card with Freud papers on it and just crossed off Die Verneinung.

ezoe · on May 13, 2023

Or do we really understand it? Many humans ignore negation and condition too.

golemotron · on May 13, 2023

The models shouldn't feel bad. Hegel struggled with it too.

petepete · on May 13, 2023

So do humans, even smart ones. I've lost count of how many times I've heard people say "I could care less" about something they don't care about.

umanwizard · on May 13, 2023

That’s a stock phrase in colloquial American English. Using it isn’t really evidence that someone “struggles with the concept of negation”.

petepete · on May 13, 2023

An incorrect stock phrase that means the exact opposite of what's been said. I don't think it's unreasonable to say people who get things entirely wrong are struggling.

umanwizard · on May 13, 2023

The whole point of stock phrases, idioms, expressions, etc. is that their meaning is “non-compositional”; i.e., they mean something different from the literal combination of words that make them up. Including even extreme cases where they can mean the opposite, like this one. That doesn’t mean they’re “wrong”, it’s just a quirk of human language.

petepete · on May 13, 2023

It does. "I couldn't care less" is a common term and "I could care less" is a misunderstanding.

I'd hope more people would realise they're saying they _do care_ about something they're claiming to not care about. It sounds stupid to a native English speaker.

umanwizard · on May 13, 2023

You’re just repeating the same point over and over rather than actually giving any argument.

> It sounds stupid to a native English speaker

You might be a native English speaker, but you’re not a native speaker of American English.

petepete · on May 14, 2023

> not a native speaker of American English

I could care less.