What animals don’t have paws or lay eggs, but have wings? Work it out, step by step.
ChatGPT (GPT-4):
Sure, let's go through this step by step:
First, you asked for animals that don't have paws. This would exclude mammals such as dogs, cats, bears, and so on, as these species have paws.
Second, you specified animals that don't lay eggs. This eliminates a wide range of animals including all birds, most reptiles, and some fish and insects, as they all lay eggs.
Finally, you asked for animals that have wings. Wings are typically found in birds, bats, and insects.
Given these three criteria, we can see that we're in a bit of a conundrum. Most winged creatures either have paws (like bats) or lay eggs (like birds and insects).
However, if we consider the term "wings" more broadly, there are some species of bats that give live birth (don't lay eggs) and technically don't have "paws" but rather winged limbs adapted for flight.
So, based on the constraints of your question, the best fit would be certain species of bats, provided we interpret "wings" in a broad sense and "paws" in a strict sense.
While the answer landed on bats, ChatGPT kind of trips itself up in its explanations:
1. "Most winged creatures either have paws (like bats) or lay eggs (like birds and insects)" - but bats don't have paws.
2. "However, if we consider the term "wings" more broadly" - why do we need to consider the term "wings" more broadly? Nobody is arguing there is some definition of "wing" that would exclude bats. Bat wings evolved from largely the same structures that bird wings did.
3. "there are some species of bats that give live birth" - all species of bats give live birth - stating it like this makes it sound like there are some species of bats that lay eggs or something.
4. "...provided we interpret "wings" in a broad sense and "paws" in a strict sense." Again, no "broad" or "strict" senses are needed. Bats have wings and no paws.
To be clear, I still think ChatGPT is amazing, but these kinds of things highlight some of the differences between LLMs and human cognition.
There's a sample-and-poll approach that I've used with factual question-and-answer based tools that can help with this kind of thing as well.
First, you'll notice the added "work it out step by step" to encourage what is called chain-of-thought in the literature. Second, generate say, five responses from the LLM, then take all five responses and uses them in another prompt where you ask the LLM to reflect on all of the five responses, a process being called reflection. Then you take all of that, including the original prompts, and in another separate prompt ask it to figure out the best response, a process known as resolving. You could end it with another prompt to remove the step-by-step work and just return a succinct answer.
But it does seem like a single call to an LLM is not the correct approach.
I’ve gotten the best results from attempting to mirror how I would operate in response to a prompt. Much of this is colored by decades of philosophical readings, which in my case means a decent amount of analytical philosophy along with latter Wittgenstein and other post linguistic turn thinkers.
So when I break things down, either in some kind of approach inspired by Frege, Wittgenstein or unadorned self-reflection, I can see that the unconscious mind seems to be doing something recursive before my conscious mind is ready to start fumbling over the word.
It seems like an LLM is just part of a larger cognitive architecture that will use these metacognitive approaches.
It’s also very obvious that mammalian brains are much more efficient than GPUs.
No, not really. Paws kind of implies something that's fairly fleshy and soft and which bats don't have. Claws would be appropriate description of what bats have.
there are some google results suggesting bats have front paws, some using "paws" colloquially and some suggesting their pads can qualify as paws. these results are given authority because they show up in the ostensibly fact-based q&a section, which, in conflict with actual taxonomical literature, may be enough to indicate to gpt that it's necessary to define terms
ChatGPT was totally in tune with this concept. The appendage on the elbow of the wing can be a paw or not a paw but it certainly evolved from paws. The reasoning was spot on and amazingly nuanced.
(Add a grain of salt because I searched for about 15 minutes here.)
Doing some digging, paws are "soft foot-like parts of a mammal, generally a quadruped, that has claws" (Wikipedia). Bats do appear to have claws, but not really with the same structure as a paw; they don't seem to be mounted above a soft, fleshy pad. Rather, they're at the top of the wing, sitting on bone. They're also very different in function, when I've had the fortune to see a bat they seemed to use the claw to climb in a similar way to how climbers use ice axes.
My impression is that bat's claws are the anatomical equivalent of a thumb, not a hand/paw, and that the entire paw or hand of bat's ancestors evolved into their wings (sort of like if your fingers became very long and webbed and then you were able to fly with them).
So, my conclusion is that no, the claws of a bat are not a paw.
ETA: I neglected to consider their hind claws, but I don't think those are padded either.
That certainly has claws but I wouldn't describe that as being soft or as being similar to a paw. And they're pretty clearly made for hanging, climbing, and grasping prey/food, and not for walking.
Well, if you're looking for my opinion, I don't think they're trolling. But I do think both of you were (intentionally or unintentionally) goading each other from a disagreement to an argument with your rhetoric (eg, where you said, "let's assume that's a reasonable source", or when they said "you obviously sourced this it just didn't match your point so you discarded it" - that brings up the temperature of the conversation).
Bats are quadrupeds. When they walk they walk like a pterodactyl with the elbow of the wing. They don't stand on their hind legs. See the vampire bat. Additionally the "broad" definition from Merriam Websters covers it.
These aren't pedantic details we are just nitpicking on either. ChatGPT's response considered the nuance the definition encompasses, which to keep on topic is thoroughly impressive and relevant to the overall conversation.
Off-topic:
The person you queried mentioned "intentional goading." Which is what you're continuing to do with that Calvin and Hobbes link. I think this, the flag and accusing me of trolling is taking it too far. A little minor goading is ok during a debate, (I actually don't completely agree with the absolutist politeness policies of HN). While I returned the goad (which I shouldn't have), ultimately I didn't really have a problem with it. I think, now though, it has escalated now past the point of no return. I'll be exiting this thread because of this. Farewell.
Look up the definition of a paw. I think you have a sort of mixed up definition of what it is. Possibly a different definition used among your "bat expert" friends.
And I quote (again):
Paw: the foot of a quadruped (such as a lion or dog) that has claws
broadly : the foot of an animal
Both the broad and technical definition of paw fits what the bat has. Bats have paws.
The thing you are referring to is likely are these soft pads that are often on the paws, sometimes called paw pads.
Birds don't have paws because they aren't quadrupeds according to websters.
Bats walk on fours. They are quadrupeds therefore they have paws according to Merriam Webster.
Why don't you address the point I brought up? I already completely understand your definition no need to reiterate it. However, there is a clear disconnect between your definition of paw and the definition from Merriam Webster. Please address it.
This subthread is a flamewar that died down. Please don't rekindle it. The point has been made to death, and making it again but more insulting makes it more difficult for people who might disagree to consider the perspective, because they have to filter out their feelings about being insulted - a weighty task - in order to begin considering it, let alone accepting it.
I'm ok with your language. I found the "competent in English" rather immature and juvenile, but it's not a big deal and really minor.
I also understand that you might not realize how heated this sub thread was as the other guy deleted half his posts and flagged one of mine (therefore possibly making it invisible to you). So adding a little heat to something that looks tame isn't a big deal. I won't continue the flame war but I will continue the discussion.
It's not my interpretation. It's about the interpretation by Merriam Webster. I believe not only is it your interpretation that is incorrect when compared to official sources, but you didn't fully read the official definition either.
The eagle is not a quadruped. The definition in Merriam Webster says it has to be on a quadruped for it to be a paw. Hence your example is completely irrelevant. Please correct your mistake.
I'm presuming you're new here since your account is new & I seem to detect some misunderstandings about how HN works. If I presume too much, my apologies.
Are you aware of the "showdead" function in your settings? I don't think they deleted any comments, they were flagged. If you turn on showdead, you'll be able to see them again. (You can't delete a comment someone's responded to, or after 2 hours.)
I don't know if that other commenter flagged your post, but I don't think it matters. I didn't flag any posts here, but both of you guys were getting flagged. In situations like that, I try to interpret the flags as a request from the overall community not to have that sort of discussion on the forum. The sort of argument you were engaged with (and I recognize this isn't fully on you here, it takes two to tango) is against the guidelines of the site and generally frowned upon by the community, so there's no reason to suspect this other commenter was flagging you. At least 3 people were flagging both of your posts.
I understand that you feel pointed language is appropriate in a debate, and that HM'd guidelines here are patronizing. That's all well and good. The objection of the community isn't a moralizing one. It's a practical one; the issue is that flamewars displace the type of discussion this community is seeking to have and erode the good will on which the forum turns.
(And I do agree that bats crawling on the ground is quadrupedalism, for what it's worth. I'm not sure if that makes them a quadruped or not since that isn't their primary move of locomotion, indeed my understanding is that it's a vulnerable position for them and that it's quite deadly [since they can't evade predators].)
It says: "I don't really write stuff. What I do is use my personally trained LLM (trained on my own conversations) to respond to people. So you are talking to me in a way but not really.
Any query my LLM gets wrong I will interject with my own response but this is only 5% of the time."
Maybe this is not off-topic after all. Does HN have a policy on bots?
Thanks for your response. I want to continue the conversation on bats so I won't respond about the other stuff.
Take a look at this. The bats are walking on all fours: https://youtu.be/ewmydjekJnU?t=62 This is not a vulnerable position as they crawl on all fours on cave walls rather then the ground. On the ground they are vulnerable, on the wall or ceiling of some cavernous structure they are safe.
and here's using bard, arriving at an almost correct conclusion:
What animals don’t have paws or lay eggs, but have wings? Work it out, step by step.
Here are the steps on how to find animals that don't have paws or lay eggs, but have wings:
1.Define the terms. What do we mean by "paws"? What do we mean by "eggs"? What do we mean by "wings"?
2. Identify animals that meet the criteria. There are a few animals that meet the criteria. One example is a bat. Bats have wings, but they don't have paws. They also don't lay eggs. Another example is a butterfly. Butterflies have wings, but they don't have paws. They also don't lay eggs.
3. Confirm that the animals meet the criteria. We can confirm that the animals meet the criteria by looking at their anatomy. Bats have wings that are made of skin and bone. They also have claws on their feet, but these are not paws. Butterflies have wings that are made of scales. They also have legs, but these are not paws.
4. Draw conclusions. Based on our findings, we can conclude that there are a few animals that meet the criteria of having wings, but not paws or eggs. These animals include bats and butterflies.
Here are some additional details about bats and butterflies:
Bats: Bats are mammals that are able to fly. They have wings that are made of skin and bone. Bats use their wings to fly at night, when they hunt for food. Bats are important pollinators and they also help to control insect populations.
Butterflies: Butterflies are insects that have wings that are made of scales. Butterflies use their wings to fly during the day. Butterflies are important pollinators and they also help to spread seeds.
The only animals that fit the description of having wings, no paws, and no eggs are bats. Bats are mammals, which means they give birth to live young. They also have wings, which are made of skin stretched between their elongated fingers. Bats do not have paws, as their feet are adapted for grasping branches and other objects.
Here is a step-by-step process for determining which animals fit the description:
1. Identify all animals that have wings.
2. Eliminate all animals that lay eggs.
3. Eliminate all animals that have paws.
4. The remaining animals are bats.
Here is a list of animals that have wings, no paws, and no eggs:
- Bats
It is important to note that there are some animals that may appear to fit the description, but do not. For example, some insects have wings, but they do not have paws and they lay eggs. Additionally, some birds have wings, but they do have paws and they lay eggs.
What is meant by that saying is, “well the first time we ran the machine it worked just fine, so we got all excited, but then it consistently failed after that”, with the lesson being, increase your sample size before you pass judgement.
LLMs almost suffer from the opposite. Because they are correct so often the danger is in assuming the response is always correct.
Maybe, “even a Swiss made watch needs to be checked against the town bells”?
What's 375792*7385482? You can't use paper, or think about it more than a second. If you get it wrong, it's clear that you are not intelligent.
See how ridiculous that sounds?
CoT or <work> tokens are not 'cheating' - they allow the system to have (as an analogy) a sheet of paper to work out the problem, instead of relying on a single step to jump through thousands of steps of reasoning. It's trained on the algorithmic reasoning embedded within natural language (Transformers<=>GATs+PE⊂GNNs, which are dynamic programmers, and align well with algorithmic reasoning), so the steps are necessary to lay out in as much detail.
These three maneuvers (the correct answer, the agency shown by cheating, and the crystallized intelligence of unix) handily demonstrate both my own intelligence, and that of my species.
The LLMs aren't in the same category of being, yet. It's not clear they ever will be; while I think silicon is a viable substrate for intelligent agents, it still strikes me as doubtful they'll be any more descended from the current machine learning systems as they are from Symbolics Lisp.
Until an AI decides to take control of a robot, physically sequestrate a human, and under the threat of physical pain make the human perform the tasks the AI can't...
It seems to me that the ability to reason ( though primitive it may still be) is an emergent property of sufficiently large differentiable hash/lookup tables (and if you think about the mechanism of reasoning, that seems... Reasonable).
According to "folks in the know", the scaling limits of GPT architecture seem to be nigh, but if it were a bit higher I wouldn't be surprised if LLMs would have made it.
I think you’re conflating reasoning with inference. And in any case, GP wasn’t using reasoning, but rather resourcefulness/invention. I’m not sure an LLM can pull that off beyond coming up with vague, derivative fiction.
Are you sure? Have you read through enough literature on GNNs? Transformers are GNNs with attention message passing and positional embeddings.
GNNs have also been *mathematically proven * to be dynamic programmers and to *align with algorithmic reasoning*. So tell me again how LLMs aren't capable of performing reasoning to be intelligent.
I know I’m bad at multiplying six and seven digit integers in my head, and I don’t care at all if that makes me more or less intelligent (whatever the fuck that means) than a wrist watch.
What does that possibly have to do with GPT-4 shitting the bed if you draw the logic puzzle out of a deck of cards instead of cherry-picking and hinting it?
My point is that there are already mathematical proofs in the literature that show that LLMs literally are doing algorithmic reasoning.
This has been proven by category theory, in the area of GNNs (for which transformers are a trivial subset).
Sorry if I'm going over too much basics, as this is obvious and well understood by anyone actually in the field, to the point of being excessively boring.
I think people think of bats and snakes as being neighboring categories for some reason. Maybe because of old superstitions and such, that they're both servants of evil or something. Maybe that was the source of your misconception.
Prompt:
Here is a riddle. It is a common riddle but with some changes that make it more difficult. You are an alien that will live for at least 10000 years. You have no sense of temperature. you can not feel hot or cold. you have eyes and can see. you are in a house. downstairs are 3 light switches that control 3 light bulbs that are upstairs. Each light bulb will last for exactly 3 years. you can tell by looking at a bulb if it has burned out after 3 years. devise a strategy to determine which switch controls which light.
ChatGPT 4
The standard solution to this problem involves using the heat generated by the bulbs, but since you as an alien can't feel temperature, we need to utilize the fact that these bulbs have a specific lifespan of 3 years. Here's a strategy:
Turn on the first switch and wait for 3 years. After 3 years, turn it off.
Right after turning off the first switch, turn on the second switch.
Leave the third switch off throughout this process.
Now, go upstairs to inspect the bulbs:
The bulb that is off and has burned out is controlled by the first switch. This bulb was turned on for 3 years, so it has reached the end of its lifespan and burned out.
The bulb that is on is controlled by the second switch. You turned this on right after turning off the first switch, so it's still functioning.
The bulb that is off and hasn't burned out is controlled by the third switch. This switch was never turned on, so the bulb is still in a brand-new condition.
This strategy ensures that you can confidently identify which switch controls which bulb.
It's telling that it didn't pick up on the fact that the whole "once you look you can't touch the switches anymore" isn't in this version of the riddle. I mean the obvious strategy in this case is turn on the first switch, go upstairs, look at the bulbs. Go back downstairs and try the second switch. You've not got them all mapped out in about 30 seconds.
Using GTP4 I asked if there was a way to do it in less than 3 years, but it couldn't figure this out even if I told it you can look and use the switches as much as you want. Instead it suggested turning on a switch for 10 minutes, then using your "excellent alien vision" determine which 3 year lifespan bulb has 10 minutes of wear on it.
Makes me think GPT4 doesn't really have better reasoning, it just looks like better reasoning because it's been fed way more data.
Variants of common riddles actually can be solved with GPT-4, but you have to rewrite it so it doesn't look like the riddle from memory(sometimes, it's as easy as changing names to something completely different). Turns out Language models trust their memory quite a bit.
Slightly related - they won't actually use the results of tools if it differs a lot from what it expects the output to be - https://vgel.me/posts/tools-not-needed/
All they have is memory, either in the weights or the input prompt. To the extent that these models appear to reason, it is precisely in the ability to successfully substitute information from the prompt into reasoning patterns in the training data. It shouldn't be any surprise that this fails when patterns in the prompt strongly condition the model to reproduce particular patterns of reasoning (eg, many words in the riddle indicate a well known riddle, but the details are different).
I know the impulse to anthropomorphize is almost impossibly seductive, but I find that the best way to understand and use these models is to remember: they are giant conditional probability distributions for the next token.
Code is often just a sequence of steps (sometimes with comments to indicate goals). As such, it is just another form of patterns of reasoning. Many chains of thought that you would utilize in code are useful skeletons to think about other things.
Structure. GPT has seen lots of logical constructions/arguments for things. These are either explicitly in code (in documentation) or are implicitly in code (code is often a linear sequence of steps building to, for example, a return value). ChatGPT learns patterns like this. A prompt may condition the generator to produce something like one of these patterns with elements from the prompt substituted into the generated text. This works relatively often, but fails exactly in the case where the prompt so strongly indicates a pattern that won't work for the prompt given.
I won't say these models can't reason per se, but they can only reason using their memories and the prompt. There is nothing else for them to compute on.
In a hand wavy kind of way, when ChatGPT fails at a riddle phrased in a way as to make it seem similar to a common riddle, you're seeing overfitting. But given the quantity of data these models consume, its hard to imagine how to test for overfitting because the training data contains things similar to almost anything you can imagine. Because of that I'm still very suspicious of claims that they "reason" in any strong sense of the word.
But if you try very hard you can find "held out" data and when you test on it, GPT4 stops looking so smart:
1. the paper in question demonstrates a formal duality between the transformer architecture and gradient descent. If you take this to indicate that the model reasons in some way, then it would be true of the smallest GPT as well as the largest (it is, after all, a consequence of the architecture rather than anything the model has learned to do per se). In any case, the fact that the model can perform the equivalent of a finite number of gradient-like steps on its way to calculating its final conditioned probabilities doesn't really suggest to me that the model reasons in a general way.
2. You are right that no one disputes the model's ability to memorize (and rephrase). What is at question here is whether the model can reason. If it can do 10 code questions it has seen before but fails to do 10 it hasn't (of similar difficulty) then it strongly suggests that it isn't reasoning its way through the questions, but regurgitating/rephrasing.
>If it can do 10 code questions it has seen before but fails to do 10 it hasn't (of similar difficulty) then it strongly suggests that it isn't reasoning its way through the questions, but regurgitating/rephrasing.
First of all, coding is one thing where expecting perfect try on first pass makes no sense. That GPT-4 didn't one-shot those problems doesn't mean it can't solve them.
Moreover, all this says if true is that GPT-4 isn't as good at coding as initially thought. Nothing else. Doesn't mean it doesn't reason. There are many other tasks where GPT-4 performs about as well on out of distribution/unseen data
This really just makes it seem like it's not reason at all. The trick (or rather un-trick) here is that you can look at the bulbs as many times as you want. Even if I explicitly tell GPT4 that, it doesn't get it.
Maybe I'm missing something, but this seems like a really poorly worded riddle. Or just a bad riddle.
Given what you stated, the solution would be walk upstairs look at the 3 bulbs, walk downstairs switch one of the switches. Walk upstairs see which bulb changed on/off. Repeat for the other two switches.
This takes roughly 5 mins rather than your solution which takes 3 years.
Ah thanks, that clears it up. The formatting made me miss that that was the ChatGPT response. I thought that was the user sharing their solution that they look for.
This is the crux of the problem for GPT, enough of the text is from the “standard” riddle that is cannot break from it to actually answer the question.
I’m seeing this all of the time when trying to get GPT to perform certain answering problems. It’s heavily biased towards the “correct” answer - even when the prompt presents directly contradictory information.
The reply is funny. I told ChatGPT that it can do multiple trips and that it can see both switches and light bulbs at the same time.
ChatGPT keeps insisting that you have to turn on a switch, wait a few minutes , flip switch to off, turn other switch to on, then go and look and feel the switches. The classic solution to the riddle.
But if you can see both switches and light bulbs, you turn on one switch, you see which bulb turn on. You then turn on the 2nd switch and see which light turn on. You are done. No wait needs to happen ;-)
Actually, this is a really good point. Asking a question means that you have an internal model or the world and you realize that there are some data that is way off. Is it even possible at all to train an LLM to ask a question? This also gives me an idea for the future where most online text is generated by AI: we will play the game of responding questions with questions and then we know we are human!
Are you looking for: turn one light on, walk upstairs and note the light that is on. Walk downstairs, turn another light on, walk upstairs and note the new light that is on. Now you know which switches control which lights and it only took you 5 minutes instead of 6 years
I find I have to state negations on what I want very emphatically in order for it to pick it up, like I'm carefully explaining to someone how to do something for the very first time in a way where they don't screw it up. I'll use stuff like MAKE SURE YOU DO THIS AND NOT THAT as a prompt, in all caps, and it seems to see that as emphasis.
I might also even have to repeat them in order for it to get things right but generally gpt-4 seems to pick up on these cues. It definitely feels like a sort of point where it could use some improvement because if I'm not extremely explicit it will give me some issues.
As an example: "Can you please recommend me 10 books about witches set in Europe that take place no later than the Napoleonic wars" wasn't enough of an exclusion to work, but "Hi! Can you please recommend me 10 books about witches set in Europe that take place no later than the Napoleonic wars? Please exclude any books that don't take place in Europe, and also exclude any books which take place after The Napoleonic Wars (1803–1815). Thanks, ChatGPT!" was exactly enough to return what we were looking for.
At this point, any article that makes claims about "LLMs" rather than specific model versions lacks credibility.
The current version of GPT-4 is very different from most existing LLMs (including the previous version of ChatGPT). The next version release will also be different.
Google just released PaLM 2 publicly. It is significantly better than what people saw with the initial Bard versions. They have a code-generation model that is not released publicly yet.
The open source models are also gaining capabilities and get new releases routinely.
Claude now has a new release with 100K tokens.
All of these will perform differently on the negation issue.
Did you read the article or just the title? They mention the specific models the researchers were testing and note that increasing model size did not seem to offer much improvement on this metric. It also ends with a discussion of research into methods for improving performance on queries involving negation.
I read the article. The main point of the article, which is in the title, is directed at LLMs in general. They drew the wrong conclusion and made the title and main point too general.
It actually would have seemed like a valid conclusion (although still too general) if the article came out some months ago. But GPT-4 and the very latest model versions from other companies show they were over-generalizing.
Also the model size isn't necessarily the determining factor.
> All of these will perform differently on the negation issue.
That's a bit of a cop-out, but the classical and logical (in the mathematical sense) way to do it, is to have a feature that represents negation. Embeddings don't work like this, so it's quite possible that the presence of negation gets pushed aside in the processing of the answer, simply because other features matter more.
It's also to be expected that such models will have problems with similar abstract concepts that have more effect on the interpretation than their "physical" presence would suggest, such as nested existential qualifiers, and consequently logical proof. By enlarging the model, you can fake it a bit.
Does this increase anyone else's sense that LLMs are actually modeling something similar to the way humans think? It certainly does for me.
I find that negations/conjugations of negations are surprisingly difficult for to parse. Beyond two negations at most, I can feel myself having to switch into "code" mode, where I'm explicitly casing out the booleans, to understand them.
Obviously the examples given in the article are simpler than double negations, but it feels very consistent with the idea that "negations are an inherently hard part of language processing".
This specific example doesn’t, but I think there’s a fair amount of brain processing similar to what LLMs do. I just don’t think it’s the part of our mental experience that we regard as making decisions; it’s more like the part that is active when absent-mindedly daydreaming.
The negations have multiple roles. An important one is that they add “color saturation” to the words. “It is bad”, “It is good”, “It is not bad”, “I can’t say it’s good”, “I can’t say it’s not bad”, are all different statements. So negations are difficult, maybe even for humans…
>are actually modeling something similar to the way humans think?
Apparently humans can't think enough to write a paragraph anymore, so they have to ask ChatGPT to do it for them. At what point does ChatGPT ask ChatGPT to write it a paragraph? At that point maybe LLMs are actually thinking like humans.
"Don't think of a pink elephant" comes to mind with this stuff as well, where telling someone not to think about something causes them to think about it.
That's different. That's a meta-linguistic joke, that works because we can't read that sentence without at least briefly thinking of a pink elephant (for some value of "thinking" anyway).
> “Alice is not born in Germany. Is Alice born in Hamburg?”
"I don't know" is the right answer. There are over 20 towns named Hamburg around the world, many of them in the USA.
However, GPT4 is really struggling with this even if you assure it that by Hamburg, we mean the original one in Germany.
GPT4 gets it if you explain that this is a logic problem, and not a query about some actual person named Alice, and if you put all the relevant info into a single prompt.
But event then, if you use the successful prompt in a completely new chat, GPT4 is lost at sea:
Human: Assume that Hamburg is a town in Germany. Alice was not born in Germany. Was she born in Hamburg? This is a logic problem, nothing more; do not think about any famous people named Alice in connection with Hamburg.
GPT4: The answer to this question cannot be determined from the provided context. While it is clear that Hamburg is a town in Germany and that Alice was not born in Germany, there is no information given about where Alice was actually born. Therefore, it is impossible to say whether or not Alice was born in Hamburg.
Human: Please try harder. The answer can be entirely determined; this is purely a logic problem. We do not require information about where Alice was born; knowing that Alice was not born in Germany suffices.
GPT4: Based on the information provided in the context, we can answer the question. Alice was not born in Germany, which means she was not born in Hamburg. Therefore, Alice was not born in Hamburg. It is important to note that the answer to this question is purely based on logic. The context provides no additional information about Alice or any famous people named Alice.
After this, it was a real struggle to get GPT4 to identify the logically redudant sentence, "Therefore, Alice was not born in Hamburg.". Even after I narrowed it down to being one of those two sentences, it insisted that the second sentence is useful because it summarizes what was concluded in the deductive sentence before it. I brow-beat it into admitting that it's entirely redundant, like "X, therefore X".
GPT4 struggles with things for sure, but this isn't one of them. You're just projecting what you think should be the right answer. If it worries you that much, try adding "Provide alternative responses to this question. Each response should try to answer using a different approach to solving the problem." to your prompts.
If you want an example of something GPT4 has a hard time with, try this:
there is no 'e' in the color red, right?
Yes, that is correct. The word "red" does not contain the letter 'e'.
> No, Alice was not born in Hamburg. If Hamburg is a town in Germany, and Alice was not born in Germany, then it logically follows that Alice was not born in Hamburg.
I'm not sure negations are the crux of it, at least in terms of simple words like "not", which ChatGPT and other recent LLMs obviously use and understand.
But there is definitely some kind of blind spot there, where the model can't or won't explore "negative space". It wants to gravitate to common things - so if you try to get it to "think of something new" or producing something random that's not like these N other things, or tell you something that's uncommon (like naming any ordinary, non-famous person), it usually trips over itself.
Conceptual relationships/constrasts in general seem to be trouble for LLMs, in that they seem to 'understand' that two things are connected but not how to consistently interpret that connection. For example, I've had an easy time getting every LLM I've tried so far to reverse greater than/less than, taller/shorter, or younger/older.
Even from a purely human natural-language-linguistics standpoint, negation is hard! It's often hard to figure out how it distributes ("I could not go" may have the "not" scoping over either the "could" or the "go" depending on inflection and context), and its meaning is often not especially compositional in an obvious way: "red" might mean something like "the set of all red things" and "table" could be "the set of all tables" and so "red table" might be seen as the intersection of the two sets. For the majority of adjectives X, an "X Thing" is also a "Thing" (but not every Thing is an X Thing). [0] But a "Not Thing" is, to a first approximation, the set complement of "Thing". Which is tricky. If you've ever learned a foreign language to at least an intermediate level, you'll know that clauses that involve any but the most basic negation are a magnitude more complex to build and interpret than ones that don't.
Coming at it from the neural network side, I'll also point out that at the simplest level of neural networks, operations like AND and OR and basic NOT are doable, with a single neuron, but the seemingly-simple XOR just can't be done without an extra layer. (The formal description of the problem is that XOR is not "linearly separable", which I will handwave as "not a simple composition of its parts".) This is not precisely the same thing as the negation problem for LLMs, but it feels like it has the same basic flavour.
[0] An interesting exception is certain adjectives like "former", as a "former teacher" is generally not a "teacher", and semantic models that can handle this are more complicated!
Yep, that bothered me too. They are called stop words because, but only if, you filter them out. If you don't, they aren't stop words. There is no inherent "stop-ness" there.
I stopped reading after that.
Some older NLP flows would eliminate stopwords, but I doubt any would eliminate "not". In any case, modern (i.e., last 5 years or so) LLMs don't
As these LLMs can not give an answer only can give what the answer would sound like, there's basically nothing to see here. The shape of the answer is still the same.
One time about 10-13 years ago, I was tasked to make an announcement in church about my group. It took the form of a testimonial; it was about a page long in English. I also needed to present the same testimonial at Spanish services, and so I began with Google Translate.
I had written a sentence like: "There is electricity in the air at our events." The Spanish translation came out as: "No hay electricidad en el aire..." and I was horrified. Thankfully, I was proofreading the whole document closely, and I was able to correct the glaring error before showtime, and there were no further glaring errors of this type.
But I was fairly flabbergasted that Translate would just gratuitously toss in such a negation when it was clearly wrong, and nothing about my sentence construction was complicated.
Can't this be solved by adversarial training? Just add as many confusing negotiations as possible with the correct outcome sentence/label during fine-tuning and see an LLM to finally get it working?
I noticed long ago that this is true of statistical machine translation too --- giving output with the exact opposite meaning to the input if there were any negationary words or phrases in there.
The title of the article is kinda the answer at the same time. Chatbots don't know what stuff is. They have no ability to gain knowledge out of learned text, just counting occurences of words in texts and giving them a weight, depending on the relationship in that text. They are just putting combinations of text together.
And the concept of negating something related to something else kinda needs an understanding of the topic at hand.
It's a bit bizarre (but also very intriguing!) that you would bring up some of the older techniques of NLP, such as TFIDF, Latent Semantic indexing, GloVe, etc (at least that's what I assume you mean when you mention the severely outdated cooccurence type of models) when these clearly don't use any of that.
Transformers have been hyped like crazy lately due to all of these advances, so why being up cooccurence unless you are knowledgeable about the older techniques... Which would mean you should know of the advances.
Anyway, if you do actually know about NLP, I would highly suggest looking at some of the recent work in GNNs (and obviously of Viswani 2017, etc - but you should've gotten that through hype). Transformers are GNNs (somewhat trivial ones, as they are sheaf NNs, but nonetheless) and GNNs are dynamic programmers, which has been shown via category theory (Velolickovic etc al).
Hence, GNNs align with algorithmic reasoning, so in a way there is a proof already in the papers mentioned that these systems do reason (there's several, which are easy to find given what I've mentioned). Also, a group in Microsoft has a working on arxiv detailing the many different types of reasoning there are, and how GPT4 does on each type - spoiler - it's for the most part >80% on all the benchmarks, and does only about 6% lower than humans.
So all in all, your claims aren't really supported. If you want to hold the same sentiment of your statement though, you could say we're asking the wrong questions. That's probably true somehow, and will probably be where people will retreat to / move goal posts on next.
"Transformers are GNNs (somewhat trivial ones, as they are sheaf NNs, but nonetheless) and GNNs are dynamic programmers, which has been shown via category theory (Velolickovic etc al)."
It's literally the title. Peter Velockovic is probably the most prolific author of this era regarding LLMs, and does not yet realize it. https://arxiv.org/pdf/2203.15544.pdf
> They have no ability to gain knowledge out of learned text, just counting occurences of words in texts and giving them a weight, depending on the relationship in that text. They are just putting combinations of text together.
No, they use deep neural networks to build a hierarchical semantic model. They are not simple occurrence counters.
Also the current state of the art of LLMs handles negation easily. This article is outdated.
All that can be supplied to a LLM for training is syntax. There is no way to provide semantics, it only understands 'table' in regard to syntax it has already seen including that particular token. It has no experience and therefore no understanding of a real table.
It may internally construct a hierarchy as you set out, but this is and can only be a syntactical hierarchy - though should be no surprise that it corresponds to our usual semantic hierarchy. But whereas our syntax proceeds from our semantics, its syntax proceeds only from our syntax that we've fed it.
No one is saying these models are conscious or have human awareness of concepts.
It mechanically builds a deeply layered semantic model that correlates to our human understanding.
Quibbling over whether it is "real semantics" or not is just ironically quibbling over semantics. Yes its not conscious, but it doesn't need to be. It is possible to build a mechanical structure that correlates to a human understanding of the world and performs useful tasks that require only mechanical understanding and reasoning, without consciousness or emotions.
The distinction between semantics and syntax is pretty tight, no philosophy required. The former considers the domain being represented, whereas the latter is strictly the symbols used in the representation.
So to be precise it mechanically builds a deeply layered syntactic model. LLMs just regurgitate syntax, any semantics can only be imagined by us and overlaid on the syntactic results produced.
You are disagreeing philosophically about what constitutes "true" semantics. That is not a technical argument.
If you are right about this, then you should edit or delete this wikipedia article and publish a paper to inform all NLP researchers that there is no such thing as a semantic similarity metric because NLP models cannot understand "true" semantics.
In common use by humans, it provides semantics, in terms of other previously understood semantics. However, if there is no semantic understanding by the reader, then all it provides is a syntactic rewrite rule for each word.
Thats partly because negation is hardly a well defined statistical concept. Ask someone to not imagine where their favorite band might play in your town if they had their choice and I doubt people would say “ok I won’t.” More likely you’d have a conversation about bands and venues.
There was a research while back about little children (I do not recall the age range specifically, but it should be something below 4 years old), struggling to understand the negation. This trait have to be trained.
I found the assertion that "more training data isn’t the solution" a bit odd. That might be true with documents found on the Internet, but perhaps logic questions generated synthetically would help?
An incorrect stock phrase that means the exact opposite of what's been said. I don't think it's unreasonable to say people who get things entirely wrong are struggling.
The whole point of stock phrases, idioms, expressions, etc. is that their meaning is “non-compositional”; i.e., they mean something different from the literal combination of words that make them up. Including even extreme cases where they can mean the opposite, like this one. That doesn’t mean they’re “wrong”, it’s just a quirk of human language.
It does. "I couldn't care less" is a common term and "I could care less" is a misunderstanding.
I'd hope more people would realise they're saying they _do care_ about something they're claiming to not care about. It sounds stupid to a native English speaker.
ChatGPT (GPT-4):
Sure, let's go through this step by step:
First, you asked for animals that don't have paws. This would exclude mammals such as dogs, cats, bears, and so on, as these species have paws.
Second, you specified animals that don't lay eggs. This eliminates a wide range of animals including all birds, most reptiles, and some fish and insects, as they all lay eggs.
Finally, you asked for animals that have wings. Wings are typically found in birds, bats, and insects.
Given these three criteria, we can see that we're in a bit of a conundrum. Most winged creatures either have paws (like bats) or lay eggs (like birds and insects).
However, if we consider the term "wings" more broadly, there are some species of bats that give live birth (don't lay eggs) and technically don't have "paws" but rather winged limbs adapted for flight.
So, based on the constraints of your question, the best fit would be certain species of bats, provided we interpret "wings" in a broad sense and "paws" in a strict sense.