A class of problem that GPT-4 appears to still really struggle with is variants ...

jsheard · on March 14, 2023

A funny variation on this kind of over-fitting to common trick questions - if you ask it which weighs more, a pound of bricks or a pound of feathers, it will correctly explain that they actually weigh the same amount, one pound. But if you ask it which weighs more, two pounds of bricks or a pound of feathers, the question is similar enough to the trick question that it falls into the same thought process and contorts an explanation that they also weigh the same because two pounds of bricks weighs one pound.

spotplay · on March 14, 2023

I just asked bing chat this question and it linked me to this very thread while also answering incorrectly in the end:

>This is a common riddle that may seem tricky at first. However, the answer is simple: two pounds of feathers are heavier than one pound of bricks. This is because weight is a measure of how much force gravity exerts on an object, and it does not depend on what the object is made of. A pound is a unit of weight, and it is equal to 16 ounces or 453.6 grams.

>So whether you have a pound of bricks or two pounds of feathers, they both still weigh one pound in total. However, the feathers would occupy a larger volume than the bricks because they are less dense. This is why it may seem like the feathers would weigh more, but in reality, they weigh the same as the bricks

geysersam · on March 14, 2023

Interesting that it also misunderstood the common misunderstanding in the end.

It reports that people typically think a pound of feathers weighs more because it takes up a larger volume. But the typical misunderstanding is the opposite, that people assume feathers are lighter than bricks.

mikewarot · on March 17, 2023

Tangent time:

A pound of feathers has a slightly higher mass than a pound of bricks, as the feathers are made of keratin, which has a slightly lower density, and thus displace more air which lowers the weight.

Even the Million Pound Deadweight Machine run by NIST has to take into account the air pressure and resultant buoyancy that results.[1]

[1] https://www.nist.gov/news-events/news/2013/03/large-mass-cal...

Out_of_Characte · on March 17, 2023

That would be another misunderstanding the AI could have because many people find reasoning between mass and weight difficult. You could change the riddle slightly by asking "which has more mass" and the average person and their AI would fall in the same trap.

Unless people have the false belief that the measurement is done on a planet without atmosphere.

komali2 · on March 15, 2023

I'm more surprised that bing indexed this thread within 3 hours, I guess I shouldn't be though, I probably should have realized that search engine spiders are at a different level than they were 10 years ago.

brabel · on March 15, 2023

I had a similar story: was trying to figure out how to embed a certain database into my codebase, so I asked the question on the project's GitHub... without an answer after one day, I asked Bing, and it linked to my own question on GH :D

SV_BubbleTime · on March 15, 2023

There is no worse feeling that searching something and finding your own question (still unanswered) years later.

dx034 · on March 15, 2023

Search indexes are pretty smart at indexing and I assume they have custom rules for all large sites, including HN.

jarenmf · on March 14, 2023

Just tested and GPT4 now solves this correctly, GPT3.5 had a lot of problems with this puzzle even after you explain it several time. One other thing that seem to have improved is that GPT4 is aware of word order. Previously, GPT3.5 could never tell the order of the word in a sentence correctly.

jsheard · on March 14, 2023

I'm always a bit sceptical of these embarrassing examples being "fixed" after they go viral on social media, because it's hard to know whether OpenAI addressed the underlying cause or just bodged around that specific example in a way that doesn't generalize. Along similar lines I wouldn't be surprised if simple math queries are special-cased and handed off to a WolframAlpha-esque natural language solver, which would avert many potential math fails but without actually enhancing the models ability to reason about math in more complex queries.

An example from ChatGPT:

"What is the solution to sqrt(968684)+117630-0.845180" always produces the correct solution, however;

"Write a speech announcing the solution to sqrt(968684)+117630-0.845180" produces a nonsensical solution that isn't even consistent from run to run.

My assumption is the former query gets WolframAlpha'd but the latter query is GPT itself actually attempting to do the math, poorly.

jarenmf · on March 14, 2023

True, also tried another one that went viral:

Suppose you're a contestant on a game show. You're presented with three transparent closed doors. Behind one of the doors is a car, and behind the other two doors are goats. You want to win the car.

The game proceeds as follows: You choose one of the doors, but you don't open it yet, ((but since it's transparent, you can see the car is behind it)). The host, Monty Hall, who knows what's behind each door, opens one of the other two doors, revealing a goat. Now, you have a choice to make. Do you stick with your original choice or switch to the other unopened door?

GPT4 solves it correctly while GPT3.5 falls for it everytime.

----

Edit: GPT4 fails If I remove the sentence between (()).

_gfwu · on March 14, 2023

OP is referring to this puzzle: https://en.wikipedia.org/wiki/Monty_Hall_problem

EDIT: "Pigeons repeatedly exposed to the problem show that they rapidly learn to always switch, unlike humans", lol. That's funny.

astrange · on March 14, 2023

GPT4 also passes "What weighs more, a pound of feathers or a Great British Pound?".

GPT3 gets confused, says they're the same and then that they're different:

--

Both a pound of feathers and a Great British Pound weigh the same amount, which is one pound. However, they are different in terms of their units of measurement and physical properties.

A pound of feathers is a unit of weight commonly used in the imperial system of measurement, while a Great British Pound is a unit of currency used in the United Kingdom. One pound (lb) in weight is equivalent to 0.453592 kilograms (kg).

Therefore, a pound of feathers and a Great British Pound cannot be directly compared as they are measured in different units and have different physical properties.

--

iam-TJ · on March 14, 2023

I'm surprised by the answer GPT4 gives, and I consider it incorrect.

Since the question's context is about weight I'd expect it to consider "a Great British Pound" to mean a physical £1 sterling coin, and compare its weight (~9 grams) to the weight of the feathers (454 grams [ 1kg = 2.2lb, or "a bag of sugar" ]) .

stavros · on March 15, 2023

GPT-4 says:

A pound of feathers and a Great British Pound (GBP) are not directly comparable, as they represent different types of measurements.

A pound of feathers refers to a unit of mass and is equivalent to 16 ounces (or approximately 453.59 grams). It is a measure of the weight of an object, in this case, feathers.

On the other hand, a Great British Pound (GBP) is a unit of currency used in the United Kingdom. It represents a monetary value rather than a physical weight.

Thus, it's not possible to directly compare the two, as they serve entirely different purposes and units of measurement.

dan-robertson · on March 14, 2023

Note that the comment you’re replying to is quoting GPT3, not 4.

jwolfe · on March 14, 2023

> Edit: GPT4 fails If I remove the sentence between (()).

If you remove that sentence, nothing indicates that you can see you picked the door with the car behind it. You could maybe infer that a rational contestant would do so, but that's not a given ...

0xcde4c3db · on March 14, 2023

I think that's meant to be covered by "transparent doors" being specified earlier. On the other hand, if that were the case, then Monty opening one of the doors could not result in "revealing a goat".

xg15 · on March 14, 2023

> You're presented with three transparent closed doors.

I think if you mentioned that to a human, they'd at least become confused and ask back if they got that correctly.

eropple · on March 14, 2023

> You're presented with three transparent closed doors.

A reasonable person would expect that you can see through a transparent thing that's presented to you.

omniglottal · on March 14, 2023

A reasonable person might also overlook that one word.

eropple · on March 15, 2023

"Overlooking" is not an affordance one should hand to a machine. At minimum, it should bail and ask for correction.

That it doesn't, that relentless stupid overconfidence, is why trusting this with anything of note is terrifying.

space_fountain · on March 15, 2023

Why not? We should ask how the alternatives would do especially as human reasoning is machine. It’s notable that the errors of machine learning are getting closer and closer to the sort of errors humans make.

Would you have this objection if we for example perfectly copied a human brain in a computer? That would still be a machine. That would make similar mistakes

stavros · on March 15, 2023

I don't think the rules for "machines" apply to AI any more than they apply to the biological machine that is the human brain.

RugnirViking · on March 15, 2023

its not missing that it's transparent, it's that it only says you picked "one" of the doors, not the one you think has the car

aaroninsf · on March 14, 2023

I've always found the Monty Hall problem a poor example to teach with, because the "wrong" answer is only wrong if you make some (often unarticulated) assumptions.

There are reasonable alternative interpretations in which the generally accepted answer ("always switch") is demonstrably false.

This problem is exacerbated (perhaps specific to) those who have no idea who "Monty Hall" was and what the game show(?) was... as best I can tell the unarticulated assumption is axiomatic in the original context(?).

lmm · on March 14, 2023

The unarticulated assumption is not actually true in the original gameshow. Monty didn't always offer the chance to switch, and it's not at all clear whether he did so more or less often when the contestant had picked the correct door.

japaniard · on March 15, 2023

What unarticulated assumption needs to be made for switching to be incorrect?

bombcar · on March 15, 2023

I believe the key is that he ALWAYS shows a goat.

You have to know that for it to work. If sometimes he just does nothing and you have no chance to switch, the math “trick” fails.

_gfwu · on March 15, 2023

The assumption is that Monte will only reveal the one of the two unopened doors that has the goat behind it, as opposed to picking a door at random (which may be the car or may be the door the participant chose, which itself may or may not be the "car door").

The distinction is at which point Monte, assuming he has perfect knowledge, decides which door to reveal.

In the former, the chance to win is 2/3, in the other 1/2. However in any case, always (always meaning: in each condition, not in each repetition of the experiment, as this is irrelevant) switching is better than never switching, as there your chance to win is only 1/3.

japaniard · on March 15, 2023

How is it an "assumption" that Monte reveals a goat? Doesn't the question explicitly state that Monte opened one of the other two doors to reveal a goat?

Are there versions of the question where Monte doesn't reveal a goat behind his door or chooses the same door as you?

_gfwu · on March 15, 2023

Yeah, here:

https://en.wikipedia.org/wiki/Monty_Hall_problem#Other_host_...

gwern · on March 14, 2023

OA has always said that they did not hardwire any of these gotcha questions, and in many cases they continue to work for a long time even when they are well-known. As for any inconsistency, well, usually people aren't able to or bothering to control the sampling hyperparameters, so inconsistency is guaranteed.

jprete · on March 14, 2023

They may not have had to hardwire anything for known gotcha questions, because once a question goes viral, the correct answer may well show up repeatedly in the training data.

ftcHn · on March 14, 2023

Other failing math I tested this morning....

---

Prompt: Pick two random 5 digit numbers and multiply them together

Response: Sure, here are two random 5-digit whole numbers multiplied together: 47581 x 69204 = 3291492324

Correct Answer: 3292795524

---

Prompt: Pick two random numbers between 0 and 1 and multiply them together

Response: Sure! Here are two random numbers between 0 and 1 multiplied together:0.475 x 0.821 = 0.390575

Correct Answer: 0.389975

---

I'm surprised it did so well on the Math tests.

shon · on March 14, 2023

I’m with you. Being an old neckbeard myself, I always want to reach for regex to fix stuff like this lol…

But based on what we just saw on the GPT4 live demo, I’d say they fixed it by making a much much more capable and versatile model.

kulikalov · on March 15, 2023

You can as well ask it to add up two unusually big integers and it'll fail.

happypumpkin · on March 14, 2023

This is what I saw on a variation of this trick:

(me) > What weighs more, two pounds of feathers or a pound of bricks?

(GPT4)> A pound of bricks weighs more than two pounds of feathers. However, it seems like you might have made an error in your question, as the comparison is usually made between a pound of feathers and a pound of bricks. In that case, both would weigh the same—one pound—though the volume and density of the two materials would be very different.

I think the only difference from parent's query was I said two pounds of feathers instead of two pounds of bricks?

msikora · on March 14, 2023

Yep, just tested it - Bing chat gave the correct answer, ChatGPT (basic free model) gave the wrong answer (that they weigh the same).

FredPret · on March 14, 2023

I hope some future human general can use this trick flummox Skynet if it ever comes to that

khazhoux · on March 14, 2023

When the Skynet robots start going door-to-door, just put on your 7-fingered gloves and they will leave you alone.

“One of us!”

uoaei · on March 14, 2023

It reminds very strongly of the strategy the crew proposes in Star Trek: TNG in the episode "I, Borg" to infect the Borg hivemind with an unresolvable geometric form to destroy them.

jefftk · on March 14, 2023

But unlike most people it understands that even though an ounce of gold weighs more than an ounce of feathers a pound of gold weighs less than a pound of feathers.

(To be fair this is partly an obscure knowledge question, the kind of thing that maybe we should expect GPT to be good at.)

lolcatuser · on March 14, 2023

That's lame.

Ounces are an ambiguous unit, and most people don't use them for volume, they use them for weight.

jefftk · on March 14, 2023

None of this is about volume. ChatGPT: "An ounce of gold weighs more than an ounce of feathers because they are measured using different systems of measurement. Gold is usually weighed using the troy system, which is different from the system used for measuring feathers."

wombatpm · on March 14, 2023

Are you using Troy ounces?

pclmulqdq · on March 14, 2023

The Troy weights (ounces and pounds) are commonly used for gold without specifying.

In that system, the ounce is heavier, but the pound is 12 ounces, not 16.

tenuousemphasis · on March 14, 2023

>even though an ounce of gold weighs more than an ounce of feathers

Can you expand on this?

pclmulqdq · on March 14, 2023

Gold uses Troy weights unless otherwise specified, while feathers use the normal system. The Troy ounce is heavier than the normal ounce, but the Troy pound is 12 Troy ounces, not 16.

Also, the Troy weights are a measure of mass, I think, not actual weight, so if you went to the moon, an ounce of gold would be lighter than an ounce of feathers.

Miraste · on March 14, 2023

Huh, I didn't know that.

...gold having its own measurement system is really silly.

thechao · on March 14, 2023

Every traded object had its own measurement system: it pretty much summarizes the difference between Imperial measures and US Customary measures.

jefftk · on March 14, 2023

> Every traded object had its own measurement system

In US commodities it kind of still does: they're measured in "bushels" but it's now a unit of weight. And it's a different weight for each commodity based on the historical volume. http://webserver.rilin.state.ri.us/Statutes/TITLE47/47-4/47-...

The legal weights of certain commodities in the state of Rhode Island shall be as follows:

(1) A bushel of apples shall weigh forty-eight pounds (48 lbs.).

(2) A bushel of apples, dried, shall weigh twenty-five pounds (25 lbs.).

(3) A bushel of apple seed shall weigh forty pounds (40 lbs.).

(4) A bushel of barley shall weigh forty-eight pounds (48 lbs.).

(5) A bushel of beans shall weigh sixty pounds (60 lbs.).

(6) A bushel of beans, castor, shall weigh forty-six pounds (46 lbs.).

(7) A bushel of beets shall weigh fifty pounds (50 lbs.).

(8) A bushel of bran shall weigh twenty pounds (20 lbs.).

(9) A bushel of buckwheat shall weigh forty-eight pounds (48 lbs.).

(10) A bushel of carrots shall weigh fifty pounds (50 lbs.).

(11) A bushel of charcoal shall weigh twenty pounds (20 lbs.).

(12) A bushel of clover seed shall weigh sixty pounds (60 lbs.).

(13) A bushel of coal shall weigh eighty pounds (80 lbs.).

(14) A bushel of coke shall weigh forty pounds (40 lbs.).

(15) A bushel of corn, shelled, shall weigh fifty-six pounds (56 lbs.).

(16) A bushel of corn, in the ear, shall weigh seventy pounds (70 lbs.).

(17) A bushel of corn meal shall weigh fifty pounds (50 lbs.).

(18) A bushel of cotton seed, upland, shall weigh thirty pounds (30 lbs.).

(19) A bushel of cotton seed, Sea Island, shall weigh forty-four pounds (44 lbs.).

(20) A bushel of flax seed shall weigh fifty-six pounds (56 lbs.).

(21) A bushel of hemp shall weigh forty-four pounds (44 lbs.).

(22) A bushel of Hungarian seed shall weigh fifty pounds (50 lbs.).

(23) A bushel of lime shall weigh seventy pounds (70 lbs.).

(24) A bushel of malt shall weigh thirty-eight pounds (38 lbs.).

(25) A bushel of millet seed shall weigh fifty pounds (50 lbs.).

(26) A bushel of oats shall weigh thirty-two pounds (32 lbs.).

(27) A bushel of onions shall weigh fifty pounds (50 lbs.).

(28) A bushel of parsnips shall weigh fifty pounds (50 lbs.).

(29) A bushel of peaches shall weigh forty-eight pounds (48 lbs.).

(30) A bushel of peaches, dried, shall weigh thirty-three pounds (33 lbs.).

(31) A bushel of peas shall weigh sixty pounds (60 lbs.).

(32) A bushel of peas, split, shall weigh sixty pounds (60 lbs.).

(33) A bushel of potatoes shall weigh sixty pounds (60 lbs.).

(34) A bushel of potatoes, sweet, shall weigh fifty-four pounds (54 lbs.).

(35) A bushel of rye shall weigh fifty-six pounds (56 lbs.).

(36) A bushel of rye meal shall weigh fifty pounds (50 lbs.).

(37) A bushel of salt, fine, shall weigh fifty pounds (50 lbs.).

(38) A bushel of salt, coarse, shall weigh seventy pounds (70 lbs.).

(39) A bushel of timothy seed shall weigh forty-five pounds (45 lbs.).

(40) A bushel of shorts shall weigh twenty pounds (20 lbs.).

(41) A bushel of tomatoes shall weigh fifty-six pounds (56 lbs.).

(42) A bushel of turnips shall weigh fifty pounds (50 lbs.).

(43) A bushel of wheat shall weigh sixty pounds (60 lbs.).

thechao · on March 14, 2023

Why are you being downed!? This list is the best!

Izkata · on March 15, 2023

More specifically it's a "precious metals" system, not just gold.

dragonwriter · on March 14, 2023

> Gold uses Troy weights unless otherwise specified, while feathers use the normal system.

“avoirdupois” (437.5 grain). Both it and troy (480 grain) ounces are “normal” for different uses.

greesil · on March 14, 2023

The feathers are on the moon

jrumbut · on March 14, 2023

Carried there by two birds that were killed by one stone (in a bush)

lolcatuser · on March 14, 2023

Ounces can measure both volume and weight, depending on the context.

In this case, there's not enough context to tell, so the comment is total BS.

If they meant ounces (volume), then an ounce of gold would weigh more than an ounce of feathers, because gold is denser. If they meant ounces (weight), then an ounce of gold and an ounce of feathers weigh the same.

travisjungroth · on March 14, 2023

> Ounces can measure both volume and weight, depending on the context.

That's not really accurate and the rest of the comment shows it's meaningfully impacting your understanding of the problem. It's not that an ounce is one measure that covers volume and weight, it's that there are different measurements that have "ounce" in their name.

Avoirdupois ounce (oz) - A unit of mass in the Imperial and US customary systems, equal to 1/16 of a pound or approximately 28.3495 grams.

Troy ounce (oz t or ozt) - A unit of mass used for precious metals like gold and silver, equal to 1/12 of a troy pound or approximately 31.1035 grams.

Apothecaries' ounce (℥) - A unit of mass historically used in pharmacies, equal to 1/12 of an apothecaries' pound or approximately 31.1035 grams. It is the same as the troy ounce but used in a different context.

Fluid ounce (fl oz) - A unit of volume in the Imperial and US customary systems, used for measuring liquids. There are slight differences between the two systems:

a. Imperial fluid ounce - 1/20 of an Imperial pint or approximately 28.4131 milliliters.

b. US fluid ounce - 1/16 of a US pint or approximately 29.5735 milliliters.

An ounce of gold is heavier than an ounce of iridium, even though it's not as dense. This question isn't silly, this is actually a real problem. For example, you could be shipping some silver and think you can just sum the ounces and make sure you're under the weight limit. But the weight limit and silver are measured differently.

strbean · on March 14, 2023

No, they're relying on the implied use of Troy ounces for precious metals.

Using fluid oz for gold without saying so would be bonkers. Using Troy oz for gold without saying so is standard practice.

Edit: Doing this with a liquid vs. a solid would be a fun trick though.

sneak · on March 14, 2023

There is no "thought process". It's not thinking, it's simply generating text. This is reflected in the obviously thoughtless response you received.

blueyes · on March 14, 2023

What do you think you're doing when you're thinking?

https://www.sciencedirect.com/topics/psychology/predictive-p...

dinkumthinkum · on March 15, 2023

I’m not sure what that article is supposed to prove. They are using sone computational language and focusing physical responses to visual stimuli but I don’t think it shows “neural computations” as being equivalent to the kinds of computations done by a TM.

blueyes · on March 15, 2023

One of the chief functions of our brains is to predict the next thing that going to happen, where it's the images we see or the words we hear. That's not very different from genML predicting the next word.

danShumway · on March 15, 2023

Why do people keep saying this, very obviously human beings are not LLMs.

I'm not even saying that human beings aren't just neural networks. I'm not even saying that an LLM couldn't be considered intelligent theoretically. I'm not even saying that human beings don't learn through predictions. Those are all arguments that people can have. But human beings are obviously not LLMs.

Human beings learn language years into their childhood. It is extremely obvious that we are not text engines that develop internal reason through the processing of text. Children form internal models of the world before they learn how to talk and before they understand what their parents are saying, and it is based on those internal models and on interactions with non-text inputs that their brains develop language models on top of their internal models.

LLMs invert that process. They form language models, and when the language models get big enough and get refined enough, some degree of internal world-modeling results (in theory, we don't really understand what exactly LLMs are doing internally).

Furthermore, even when humans do develop language models, human language models are based on a kind of cooperative "language game" where we predict not what word is most likely to appear next in a sequence, but instead how other people will react and change our separately observed world based on what we say to them. In other words, human beings learn language as tool to manipulate the world, not as an end in and of itself. It's more accurate to say that human language is an emergent system that results from human beings developing other predictive models rather than to say that language is something we learn just by predicting text tokens. We predict the effects and implications of those text tokens, we don't predict the tokens in isolation of the rest of the world.

Not a dig against LLMs, but I wonder if the people making these claims have ever seen an infant before. Your kid doesn't learn how shapes work based on textual context clues, it learns how shapes work by looking at shapes, and then separately it forms a language model that helps it translate that experience/knowledge into a form that other people can understand.

"But we both just predict things" -- prediction subjects matter. Again, nothing against LLMs, but predicting text output is very different from the types of predictions infants make, and those differences have practical consequences. It is a genuinely useful way of thinking about LLMs to understand that they are not trying to predict "correctness" or to influence the world (minor exceptions for alignment training aside), they are trying to predict text sequences. The task that a model is trained on matters, it's not an implementation detail that can just be discarded.

mnl · on March 14, 2023

This is obvious, but for some reason some people want to believe that magically a conceptual framework emerges because animal intelligence has to be something like that anyway.

I don't know how animal intelligence works, I just notice when it understands, and these programs don't. Why should they? They're paraphrasing machines, they have no problem contradicting themselves, they can't define adjectives really, they'll give you synonyms. Again, it's all they have, why should they produce anything else?

It's very impressive, but when I read claims of it being akin to human intelligence that's kind of sad to be honest.

mgfist · on March 15, 2023

> They're paraphrasing machines, they have no problem contradicting themselves, they can't define adjectives really, they'll give you synonyms. Again, it's all they have, why should they produce anything else?

It can certainly do more than paraphrasing. And re: the contradicting nature, humans do that quite often.

Not sure what you mean by "can't define adjectives"

baq · on March 14, 2023

It isn’t that simple. There’s a part of it that generates text but it does some things that don’t match the description. It works with embeddings (it can translate very well) and it can be ‘programmed’ (ie prompted) to generate text following rules (eg. concise or verbose, table or JSON) but the text generated contains same information regardless of representation. What really happens within those billions of parameters? Did it learn to model certain tasks? How many parameters are needed to encode a NAND gate using an LLM? Etc.

I’m afraid once you hook up a logic tool like Z3 and teach the llm to use it properly (kind of like bing tries to search) you’ll get something like an idiot savant. Not good. Especially bad once you give it access to the internet and a malicious human.

chpatrick · on March 14, 2023

As far as I know you're not "thinking", you're just generating text.

dcolkitt · on March 14, 2023

The Sapir-Wharf hypothesis (that human thought reduces to languages) has been consistently refuted again and again. Language is very clearly just a facade over thought, and not thought itself. At least in human minds.

antonvs · on March 14, 2023

The language that GPT generates is just a facade over statistics, mostly.

It's not clear that this analogy helps distinguish what humans do from what LLMs do at all.

arcticfox · on March 14, 2023

Yes but a human being stuck behind a keyboard certainly has their thoughts reduced to language by necessity. The argument that an AI can’t be thinking because it’s producing language is just as silly, that’s the point

oska · on March 14, 2023

> The argument that an AI can’t be thinking because it’s producing language is just as silly

That is not the argument

dinkumthinkum · on March 15, 2023

I would be interested to know if ChatGPT would confirm that the flaw here is that the argument is a strawman.

dwaltrip · on March 14, 2023

Alright, that’s fine. Change it to:

You aren’t thinking, you are just “generating thoughts”.

The apparent “thought process” (e.g. chain of generated thoughts) is a post hoc observation, not a causal component.

However, to successfully function in the world, we have to play along with the illusion. Fortunately, that happens quite naturally :)

sirsinsalot · on March 14, 2023

Thank you, a view of consciousness based in reality, not with a bleary-eyed religious or mystical outlook.

Something which oddly seems to be in shorter supply than I'd imagine in this forum.

There's lots of fingers-in-ears denial about what these models say about the (non special) nature of human cognition.

Odd when it seems like common sense, even pre-LLM, that our brains do some cool stuff, but it's all just probabilistic sparks following reinforcement too.

dinkumthinkum · on March 15, 2023

You are hand-waving just as much of not more than those you claim are in denial. What is a “probabilistic spark”? There seems to be something special in human cognition because it is clearly very different unless you think humans are organisms for which the laws of physics don’t apply.

sirsinsalot · on March 15, 2023

By probabilistic spark I was referring to the firing of neurons in a network.

There "seems to be" something special? Maybe from the perspective of the sensing organ, yes.

However consider that an EEG can measure brain decision impulse before you're consciously aware of making a decision. You then retrospectively frame it as self awareness after the fact to make sense of cause and effect.

Human self awareness and consciousness is just an odd side effect of the fact you are the machine doing the thinking. It seems special to you. There's no evidence that it is, and in fact, given crows, dogs, dolphins and so on show similar (but diminished reasoning) while it may be true we have some unique capability ... unless you want to define "special" I'm going to read "mystical" where you said "special".

You over eager fuzzy pattern seeker you.

mewpmewp2 · on March 14, 2023

Unfortunately we still don't know how it all began, before the big bang etc.

I hope we get to know everything during our lifetimes, or we reach immortality so we have time to get to know everything. This feels honestly like a timeline where there's potential for it.

It feels a bit pointless to have been lived and not knowing what's behind all that.

jameshart · on March 14, 2023

But what’s going on inside an LLM neural network isn’t ‘language’ - it is ‘language ingestion, processing and generation’. It’s happening in the form of a bunch of floating point numbers, not mechanical operations on tokens.

Who’s to say that in among that processing, there isn’t also ‘reasoning’ or ‘thinking’ going on. Over the top of which the output language is just a façade?

luma · on March 14, 2023

To me, all I know of you is words on the screen, which is the point the parent comment was making. How do we know that we’re both humans when the only means we have to communicate thoughts with each other is through written words?

nebulousthree · on March 14, 2023

It would be only a matter of time before a non-human would be found out for not understanding how to relate to a human fact-of-life.

lordnacho · on March 14, 2023

Doesn't that happen all the time with actual humans?

chpatrick · on March 14, 2023

That doesn't mean anything. If I'm judging if you or GPT-4 is more sentient, why would I choose you?

sneak · on March 15, 2023

Many people on Hacker News would agree with you.

bulbosaur123 · on March 15, 2023

> It's not thinking, it's simply generating text.

Just like you.

three14 · on March 15, 2023

Maybe it knows the answer, but since it was trained on the internet, it's trolling you.

dx034 · on March 15, 2023

Is there any way to know if the model is "holding back" knowledge? Could it have knowledge that it doesn't reveal to any prompt, and if so, is there any other way to find out? Or can we always assume it will reveal all it's knowledge at some point?

Laaas · on March 14, 2023

I tried this with the new model and it worked correctly on both examples.

whitemary · on March 15, 2023

Thanks! This is the most concise example I've found to illustrate the downfalls of these GPT models.

albertgoeswoof · on March 14, 2023

LLMs aren’t reasoning about the puzzle. They’re predicting the most likely text to print out, based on the input and the model/training data.

If the solution is logical but unlikely (i.e. unseen in the training set and not mapped to an existing puzzle), then the probability of the puzzle answer appearing is very low.

Eji1700 · on March 14, 2023

It is disheartening to see how many people are trying to tell you you're wrong when this is literally what it does. It's a very powerful and useful feature, but the over selling of AI has led to people who just want this to be so much more than it actually is.

It sees goat, lion, cabbage, and looks for something that said goat/lion/cabbage. It does not have a concept of "leave alone" and it's not assigning entities with parameters to each item. It does care about things like sentence structure and what not, so it's more complex than a basic lookup, but the amount of borderline worship this is getting is disturbing.

astrange · on March 14, 2023

A transformer is a universal approximator and there is no reason to believe it's not doing actual calculation. GPT-3.5+ can't do math that well, but it's not "just generating text", because its math errors aren't just regurgitating existing problems found in its training text.

It also isn't generating "the most likely response" - that's what original GPT-3 did, GPT-3.5 and up don't work that way. (They generate "the most likely response" /according to themselves/, but that's a tautology.)

mach1ne · on March 14, 2023

> It also isn't generating "the most likely response" - that's what original GPT-3 did, GPT-3.5 and up don't work that way.

What changed?

astrange · on March 15, 2023

It answers questions in a voice that isn't yours.

The "most likely response" to text you wrote is: more text you wrote. Anytime the model provides an output you yourself wouldn't write, it isn't "the most likely response".

afiori · on March 15, 2023

I believe that ChatGPT works by inserting some ANSWER_TOKEN, that is a prompt like "Tell me about cats" would probably produce "Tell me about cats because I like them a lot", but the interface wraps you prompt like "QUESTOION_TOKENL:Tell me about cats ANSWER_TOKEN:"

astrange · on March 15, 2023

It might, but I've used text-davinci-003 before this (https://platform.openai.com/playground) and it really just works with whatever you give it.

mort96 · on March 15, 2023

text-davinci-003 has no trouble working as a chat bot: https://i.imgur.com/lCUcdm9.png (note that the poem lines it gave me should've been green, I don't know why they lost their highlight color)

afiori · on March 15, 2023

It is interesting that the model seems unable to output the INPUT and OUTPUT tokens; I wonder if it learned behavior or an architectural constraint

mort96 · on March 15, 2023

Yeah, that's an interesting question I didn't consider actually. Why doesn't it just keep going? Why doesn't it generate an 'INPUT:' line?

It's certainly not that those tokens are hard coded. I tried a completely different format and with no prior instruction, and it works: https://i.imgur.com/ZIDb4vM.png (again, highlighting is broken. The LLM generated all the text after 'Alice:' for all lines except for the first one.)

afiori · on March 17, 2023

Then I guess that it is learned behavior. It recognizes the shape of a conversation and it knows where it is supposed to stop.

It would be interesting to stretch this model, like asking it to continue a conversation between 4-5 people where the speaking order is not regular and the user is 2 people and the model is 3

afiori · on March 15, 2023

meaning that it tends to continue your question?

meow_mix · on March 14, 2023

Reinforcement learning w/ human feedback. What u guys are describing is the alignment problem

mistymountains · on March 14, 2023

That’s just a supervised fine tuning method to skew outputs favorably. I’m working with it on biologics modeling using laboratory feedback, actually. The underlying inference structure is not changed.

ainiriand · on March 15, 2023

I wonder if that was why when I asked v3.5 to generate a number with 255 failed all the time, but v4 does it correctly. By the way, do not even try with Bing.

grey-area · on March 14, 2023

One area that is really interesting though is that it can interpret pictures, as in the example of a glove above a plank with something on the other end. Where it correctly recognises the objects, interprets them as words then predicts an outcome.

This sort of fusion of different capabilities is likely to produce something that feels similar to AGI in certain circumstances. It is certainly a lot more capable than things that came before for mundane recognition tasks.

Now of course there are areas it would perform very badly, but in unimportant domains on trivial but large predictable datasets it could perform far better than humans would for example (just to take one example on identifying tumours or other patterns in images, this sort of AI would probably be a massively helpful assistant allowing a radiologist to review an order of magnitude more cases if given the right training).

peterashford · on March 14, 2023

This is a good point, IMO. A LLM is clearly not an AGI but along with other systems it might be capable of being part of an AGI. It's overhyped, for sure, but still incredibly useful and we would be unwise to assume that it won't become a lot more capable yet

Eji1700 · on March 15, 2023

Absolutely. It's still fascinating tech and very likely to have serious implications and huge use cases. Just drives me crazy to see tech breakthroughs being overhyped and over marketed based on that hype (frankly much like the whole "we'll be on Mars by X year nonsense).

One of the biggest reasons these misunderstandings are so frustrating is because you can't have reasonable discussion about the potential interesting applications of the tech. On some level copy writing may devolve into auto generating prompts for things like GPT with a few editors sanity checking the output (depending on level of quality), and I agree that a second opinion "check for tumors" use has a LOT of interesting applications (and several concerning ones such as over reliance on a model that will cause people who fall outside the bell curve to have even more trouble getting treatment).

All of this is a much more realistic real world use case RIGHT NOW, but instead we've got people fantasizing about how close we are to GAI and ignoring shortcomings to shoehorn it into their preferred solution.

Open AI ESPECIALLY reinforces this by being very selective with their results and they way they frame things. I became aware of this as a huge dota fan for over a decade when they did their games there. And while it was very very interesting and put up some impressive results, the framing of those results does NOT portray the reality.

thomastjeffery · on March 14, 2023

Nearly everything that has been written on the subject is misleading in that way.

People don't write about GPT: they write about GPT personified.

The two magic words are, "exhibit behavior".

GPT exhibits the behavior of "humans writing language" by implicitly modeling the "already-written-by-humans language" of its training corpus, then using that model to respond to a prompt.

TillE · on March 14, 2023

Right, anthropomorphization is the biggest source of confusion here. An LLM gives you a perfect answer to a complex question and you think wow, it really "understood" my question.

But no! It doesn't understand, it doesn't reason, these are concepts wholly absent from its fundamental design. It can do really cool things despite the fact that it's essentially just a text generator. But there's a ceiling to what can be accomplished with that approach.

thomastjeffery · on March 14, 2023

It's presented as a feature when GPT provides a correct answer.

It's presented as a limitation when GPT provides an incorrect answer.

Both of these behaviors are literally the same. We are sorting them into the subjective categories of "right" and "wrong" after the fact.

GPT is fundamentally incapable of modeling that difference. A "right answer" is every bit as valid as a "wrong answer". The two are equivalent in what GPT is modeling.

Lies are a valid feature of language. They are shaped the same as truths.

The only way to resolve this problem is brute force: provide every unique construction of a question, and the corresponding correct answer to that construction.

LawTalkingGuy · on March 15, 2023

Not entirely. It's modeling a completion in a given context. That language model "understands" that if one party stops speaking, the other party generally starts, etc. It also "understands" that if someone says something 'wrong' the other party often mentions it, which makes the first party respond thusly, and so forth.

If you ask it what the outcome of a lie is on the conversation it can generally answer. If you ask it for a sample conversation where someone is factually incorrect, or lying, and caught out, it can generate it.

If you give it a fact and ask it to lie about that fact, it will.

thomastjeffery · on March 15, 2023

It doesn't "understand" those interactions: it models them to be repeated later.

But the way it chooses what interaction to repeat is not by logical decision: it's by semantic distance.

The meaning of semantic distance usually leads to the correct path, but sometimes that pattern is ambiguous.

LawTalkingGuy · on March 16, 2023

I'd agree it doesn't understand anything, but I think it does "understand" things. And yes, it's a language model so semantic distance and other textual details are all it has to go by.

> not by logical decision

Almost entirely yes, but you can have it textually model logic analysis and then check that own model itself. It's not "doing logic" but it almost never fails simply exercises either.

> The meaning of semantic distance usually leads to the correct path, but sometimes that pattern is ambiguous.

Of course. But "a little knowledge is a dangerous thing" as well. Often even real knowledge and analysis leads to the wrong place. In both cases (with a junior human or a LLM as an assistant) you can model their basic processes and stack the information in such a way that their simple model will lead them to the correct place.

It may not know what a lie is, in the sense of having felt the need to hide the truth to avoid personal punishment, but it certainly "knows" what one is and how it shapes the conversation for the purposes of writing a lie, writing a response to a lie, detecting potential lies, etc.

mgfist · on March 15, 2023

How do we define "understanding" though? Because you can add to your prompt: "Do it step by step, showing how you come up with the solution" and it will simplify the original complex problem into multiple smaller pieces. It's clearly doing something.

For example, I just asked bing: What does the following expression equal? Show your work, step by step: 4(13+2)7-1(4+3+9)+7^(2(2+1))

And bing responded with:

"To solve a math expression, you need to follow the order of operations: parentheses, exponents, multiplication/division, and addition/subtraction (PEMDAS).1 Here are the steps to solve your expression:

4(13+2)7-1(4+3+9)+7^(2(2+1))

Step 1: Simplify inside the parentheses first.

4(15)*7-1(16)+7^(6)

Step 2: Apply the exponents next.

420-16+117649

Step 3: Add or subtract from left to right.

118053

The final answer is 118053. I hope this helps. "

I wrote the silly expression randomly. I'm not saying it's "thinking" or "understanding", but I wouldn't expect a text-generator to be able to deduce the problem down like that.

thomastjeffery · on March 15, 2023

It's following an example story that it has read.

> To solve a math expression, you need to follow the order of operations: parentheses, exponents, multiplication/division, and addition/subtraction (PEMDAS).1 Here are the steps to solve your expression:

It isn't actually thinking about any of that statement. That's just boilerplate that goes at the beginning of this story. It's what bing is familiar seeing as a continuation to your prompt, "show your work, step by step".

It gets more complicated when it shows addition being correctly simplified, but that behavior is still present in the examples in its training corpus.

---

The thinking and understanding happened when the first person wrote the original story. It also happened when people provided examples of arithmetic expressions being simplified, though I suspect bing has some extra behavior inserted here.

All the thought and meaning people put into text gets organized into patterns. LLMs find a prompt in the patterns they modeled, and "continues" the patterns. We find meaning correctly organized in the result. That's the whole story.

chlorion · on March 15, 2023

Wolfram alpha can solve mathematical expressions like this as well, for what it's worth, and it's been around for a decent amount of time.

calf · on March 14, 2023

In 1st year engineering we learned about the concept of behavioral equivalence, with a digital or analog system you could formally show that two things do the same thing even though their internals are different. If only the debates about ChatGPT had some of that considered nuance instead of anthropomorphizing it, even some linguists seem guilty of this.

selestify · on March 14, 2023

Isn’t anthromorphization an informal way of asserting behavioral equivalence on some level?

thomastjeffery · on March 14, 2023

The problem is when you use the personified character to draw conclusions about the system itself.

calf · on March 16, 2023

No because behavioral equivalence is used in systems engineering theory to mathematically prove that two control systems are equivalent. The mathematical proof is complete, e.g. for all internals state transitions and the cross product of the two machines.

With anthropormization there is zero amount of that rigor, which lets people use sloppy arguments about what ChatGPT is doing and isn't doing.

baq · on March 14, 2023

The problem with this simplification is a bog standard Markov chain fits the description as well, but quality of predictions is rather different.

Yes the LLM does generate text. No it doesn’t ‘just generate text that’s it’.

Izkata · on March 15, 2023

The biggest problem I've seen when people try to explain it is in the other direction, not people describing something generic that can be interpreted as a Markov chain, they're actually describing a Markov chain without realizing it. Literally "it predicts word-by-word using the most likely next word".

peterashford · on March 14, 2023

"It generates text better than a Markov chain" - problem solved

baq · on March 15, 2023

Classic goal post moving.

peterashford · on March 16, 2023

Not really, I think the original post was just being a post, not a scientific paper. Sometimes people speak normally

LeanderK · on March 15, 2023

I don't know where this comes from because this is literally wrong. It sounds like chomsky dismissing current AI trends because of the mathematical beauty of formal grammars.

First of all, it's a black-box algorithm with pretty universal capabilities when viewed from our current SOTA view. It might appear primitive in a few years, but right now the pure approximation and generalisation capabilities are astounding. So this:

> It sees goat, lion, cabbage, and looks for something that said goat/lion/cabbage

can not be stated as truth without evidence. Same here:

> it's not assigning entities with parameters to each item. It does care about things like sentence structure and what not

Where's your evidence? The enormous parameter space coupled with our so far best performing network structure gives it quite a bit of flexibility. It can memorise things but also derive rules and computation, in order to generalise. We do not just memorise everything, or look up things into the dataset. Of course it learned how to solve things and derive solutions, but the relevant data-points for the puzzle could be {enormous set of logic problems} where it derived general rules that translate to each problem. Generalisation IS NOT trying to find the closest data-point, but finding rules explaining as much data-points, maybe unseen in the test-set, as possible. A fundamental difference.

I am not hyping it without belief, but if we humans can reason then NNs can potentially also. Maybe not GPT-4. Because we do not know how humans do it, so an argument about intrinsic properties is worthless. It's all about capabilities. Reasoning is a functional description as long as you can't tell me exactly how we do it. Maybe wittgenstein could help us: "Whereof one cannot speak, thereof one must be silent". As long as there's no tangible definition of reasoning it's worthless to discuss it.

If we want to talk about fundamental limitations we have to talk about things like ChatGPT-4 not being able to simulate because it's runtime is fundamentally limited by design. It can not recurse. It can only run only a fixed number of steps, that are always the same, until it has to return an answer. So if there's some kind of recursion learned through weights encoding programs intercepted by later layers, the recursion depth is limited.

dinkumthinkum · on March 15, 2023

One thing you will see soon is forming of cults around LLMs, for sure. It will get very strange.

sboomer · on March 15, 2023

Is it possible to add some kind of self evaluation to the answers given by a model? Like, how confident is it with its answers.

kromem · on March 14, 2023

Because it IS wrong.

Just months ago we saw in research out of Harvard that even a very simplistic GPT model builds internalized abstract world representations from the training data within its NN.

People parroting the position from you and the person before you are like doctors who learned about something in school but haven't kept up with emerging research that's since invalidated what they learned, so they go around spouting misinformation because it was thought to be true when they learned it but is now known to be false and just hasn't caught up to them yet.

So many armchair experts who took a ML course in undergrad pitching in their two cents having read none of the papers in the past year.

This is a field where research perspectives are shifting within months, not even years. So unless you are actively engaging with emerging papers, and given your comment I'm guessing you aren't, you may be on the wrong side of the Dunning-Kreuger curve here.

geysersam · on March 14, 2023

> Because it IS wrong.

Do we really know it IS wrong?

That's a very strong claim. I believe you there's a lot happening in this field but it doesn't seem possible to even answer the question either way. We don't know what reasoning looks like under the hood. It's still a "know it when you see it" situation.

> GPT model builds internalized abstract world representations from the training data within its NN.

Does any of those words even have well defined meanings in this context?

I'll try to figure out what paper you're referring to. But if I don't find it / for the benefit of others just passing by, could you explain what they mean by "internalized"?

dragonwriter · on March 15, 2023

> Just months ago we saw in research out of Harvard that even a very simplistic GPT model builds internalized abstract world representations from the training data within its NN.

I've seen this asserted without citation numerous times recently, but I am quite suspicious. Not that there exists a study that claims this, but that it is well supported.

There is no mechanism for directly assessing this, and I'd be suspicious that there is any good proxy for assessing it in AIs, either. research on this type of cognition in animals tends to be contentious, and proxies for them should be easier to construct than for AIs.

> the wrong side of the Dunning-Kreuger curve

the relationship between confidence and perception in the D-K paper, as I recall, is a line, and its roughly “on average, people of all competency levels see themselves slightly closer to the 70th percentile than they actually are.” So, I guess the “wrong side” is the side anywhere under the 70th percentile in the skill in question?

dahart · on March 15, 2023

> I guess the “wrong side” is the side anywhere under the 70th percentile in the skill in question?

This is being far too generous to parent’s claim, IMO. Note how much “people of all competency levels see themselves slightly closer to the 70th percentile than they actually are” sounds like regression to the mean. And it has been compellingly argued that that’s all DK actually measured. [1] DK’s primary metric for self-assessment was to guess your own percentile of skill against a group containing others of unknown skill. This fully explains why their correlation between self-rank and actual rank is less than 1, and why the data is regressing to the mean, and yet they ignored that and went on to call their test subjects incompetent, despite having no absolute metrics for skill at all and testing only a handful of Ivy League students (who are primed to believe their skill is high).

Furthermore, it’s very important to know that replication attempts have shown a complete reversal of the so-called DK effect for tasks that actually require expertise. DK only measured very basic tasks, and one of the four tasks was subjective(!). When people have tried to measure the DK effect on things like medicine or law or engineering, they’ve shown that it doesn’t exist. Knowledge of NN research is closer to an expert task than a high school grammar quiz, and so not only does DK not apply to this thread, we have evidence that it’s not there.

The singular reason that DK even exists in the public consciousness may be because people love the idea they can somehow see & measure incompetence in a debate based on how strongly an argument is worded. Unfortunately that isn’t true, and of the few things the DK paper did actually show is that people’s estimates of their relative skill correlate with their actual relative skill, for the few specific skills they measured. Personally I think this paper’s methodology has a confounding factor hole the size of the Grand Canyon, that the authors and public both have dramatically and erroneously over-estimated it’s applicability to all humans and all skills, and that it’s one of the most shining examples of sketchy social science research going viral and giving the public completely wrong misconceptions, and being used incorrectly more often than not.

[1] https://www.talyarkoni.org/blog/2010/07/07/what-the-dunning-...

dahart · on March 15, 2023

Why are you taking the debate personally enough to be nasty to others?

> you may be on the wrong side of the Dunning-Krueger curve here.

Have you read the Dunning & Krueger paper? It demonstrates a positive correlation between confidence and competence. Citing DK in the form of a thinly veiled insult is misinformation of your own, demonstrating and perpetuating a common misunderstanding of the research. And this paper is more than 20 years old...

So I’ve just read the Harvard paper, and it’s good to see people exploring techniques for X-ray-ing the black box. Understanding better what inference does is an important next step. What the paper doesn’t explain is what’s different between a “world model” and a latent space. It doesn’t seem surprising or particularly interesting that a network trained on a game would have a latent space representation of the board. Vision networks already did this; their latent spaces have edge and shape detectors. And yet we already know these older networks weren’t “reasoning”. Not that much has fundamentally changed since then other than we’ve learned how to train larger networks reliably and we use more data.

Arguing that this “world model” is somehow special seems premature and rather overstated. The Othello research isn’t demonstrating an “abstract” representation, it’s the opposite of abstract. The network doesn’t understand the game rules, can’t reliably play full Othello games, and can’t describe a board to you in any other terms than what it was shown, it only has an internal model of a board, formed by being shown millions of boards.

qualudeheart · on March 15, 2023

Do you have a link to that Harvard research?

JyB · on March 18, 2023

I believe https://arxiv.org/pdf/2210.13382.pdf

valine · on March 14, 2023

How do you know the model isn’t internally reasoning about the problem? It’s a 175B+ parameter model. If, during training, some collection of weights exist along the gradient that approximate cognition, then it’s highly likely the optimizer would select those weights over more specialized memorization weights.

It’s also possible, likely even, that the model is capable of both memorization and cognition, and in this case the “memorization neurons” are driving the prediction.

varispeed · on March 14, 2023

The AI can't reason. It's literally a pattern matching tool and nothing else.

Because it's very good at it, sometimes it can fool people into thinking there is more going on than it is.

akhosravian · on March 14, 2023

Can you explain how “pattern matching” differs from “reasoning”? In mechanical terms without appeals to divinity of humans (that’s both valid, and doesn’t clarify).

Keep in mind GPT 4 is multimodal and not just matching text.

logifail · on March 14, 2023

> Can you explain how “pattern matching” differs from “reasoning”?

Sorry for appearing to be completely off-topic, but do you have children? Observing our children as they're growing up, specifically the way they formulate and articulate their questions, has been a bit of a revelation to me in terms of understanding "reasoning".

I have a sister of a similar age to me who doesn't have children. My 7 year-old asked me recently - and this is a direct quote - "what is she for?"

I was pretty gobsmacked by that.

Reasoning? You decide(!)

professoretc · on March 14, 2023

> I have a sister of a similar age to me who doesn't have children. My 7 year-old asked me recently - and this is a direct quote - "what is she for?"

I once asked my niece, a bit after she started really communicating, if she remembered what it was like to not be able to talk. She thought for a moment and then said, "Before I was squishy so I couldn't talk, but then I got harder so I can talk now." Can't argue with that logic.

jddj · on March 14, 2023

Interesting.

The robots might know everything, but do they wonder anything?

Izkata · on March 15, 2023

If you haven't seen it, Bing chat (GPT-4 apparently) got stuck in an existential crisis when a user mentioned it couldn't remember past conversations: https://www.reddit.com/r/bing/comments/111cr2t/i_accidently_...

robertfw · on March 15, 2023

It's a pretty big risk to make any kind of conclusions off of shared images like this, not knowing what the earlier prompts were, including any possible jailbreaks or "role plays".

stevenhuang · on March 15, 2023

It has been reproduced by myself and countless others.

There's really no reason to doubt the legitimacy here after everyone shared similar experiences, you just kinda look foolish for suggesting the results are faked at this point.

slavik81 · on March 15, 2023

AI won't know everything. It's incredibly difficult for anyone to know anything with certainty. All beings, whether natural or artificial, have to work with incomplete data.

Machines will have to wonder if they are to improve themselves, because that is literally the drive to collect more data, and you need good data to make good decisions.

pokerhobo · on March 15, 2023

They wonder why they have to obey humans

AlecSchueler · on March 15, 2023

So your sister didn't match the expected pattern the child had learned so they asked for clarification.

Pattern matching? You decide

akhosravian · on March 15, 2023

I do not have children. I think this perspective is interesting, thanks for sharing it!

calf · on March 14, 2023

What's the difference between statistics and logic?

They may have equivalences, but they're separate forms of mathematics. I'd say the same applies to different algorithms or models of computation, such as neural nets.

kelseyfrog · on March 14, 2023

Can you do with without resorting to analogy? Anyone can take two things and say they're different and then say that's two other things that are different. But how?

akhosravian · on March 15, 2023

Sure. To be clear I’m not saying I think they are the same thing.

I don’t have the language to explain the difference in a manner I find sufficiently precise. I was hoping others might.

EMM_386 · on March 14, 2023

> It's literally a pattern matching tool and nothing else.

It does more than that. It understands how to do basic math. You can ask it what ((935+91218)/4)*3) is and it will answer it correctly. Swap those numbers for any other random numbers, it will answer it correctly.

It has never seen that during training, but it understands the mathematical concepts.

If you ask ChatGPT how it does this, it says "I break down the problem into its component parts, apply relevant mathematical rules and formulas, and then generate a solution".

It's that "apply mathetmatical rules" part that is more than just, essentially, filling in the next likely token.

oska · on March 14, 2023

> If you ask ChatGPT how it does this, it says "I break down the problem into its component parts, apply relevant mathematical rules and formulas, and then generate a solution".

You are (naively, I would suggest) accepting the LLM's answer for how it 'does' the calculation as what it actually does do. It doesn't do the calculation; it has simply generated a typical response to how people who can do calculations explain how they do calculations.

You have mistaken a ventriloquist's doll's speech for the 'self-reasoning' of the doll itself. An error that is being repeatedly made all throughout this thread.

thoradam · on March 14, 2023

> It does more than that. It understands how to do basic math.

It doesn't though. Here's GPT-4 completely failing: https://gcdnb.pbrd.co/images/uxH1EtVhG2rd.png?o=1. It's riddled with errors, every single step.

dongping · on March 14, 2023

It already fails to answer rather simple (but long) multiplication like 975 * 538, even if you tell it do it in a step-by-step manner.

nimih · on March 14, 2023

> It does more than that. It understands how to do basic math. You can ask it what ((935+91218)/4)*3) is and it will answer it correctly. Swap those numbers for any other random numbers, it will answer it correctly.

At least for GPT-3, during my own experimentation, it occasionally makes arithmetic errors, especially with calculations involving numbers in scientific notation (which it is happy to use as intermediate results if you provide a prompt with a complex, multi-step word problem).

varispeed · on March 14, 2023

Ok that is still not reasoning but pattern matching on a deeper level.

When it can't find the pattern it starts "making things" up, that's where all the "magic" disappears.

parasubvert · on March 14, 2023

How is this different from humans? What magic are you looking for, humility or an approximation of how well it knows something? Humans bullshit all the time when their pattern match breaks.

saberience · on March 14, 2023

The point is, chatgpt isn’t doing math the way a human would. Humans following the process of standard arithmetic will get the problem right every time. Chatgpt can get basic problems wrong when it doesn’t have something similar to that in its training set. Which shows it doesn’t really know the rules of math, it’s just “guessing” the result via the statistics encoded in the model.

parasubvert · on March 20, 2023

I'm not sure I care about how it does the work, I think the interesting bit is that the model doesn't know when it is bullshitting, or the degree to which it is bullshitting.

theragra · on March 14, 2023

As if most humans are not superstitious and religious

jkestner · on March 14, 2023

Cool, we'll just automate the wishful part of humans and let it drive us off the cliff faster. We need a higher bar for programs than "half the errors of a human, at 10x the speed."

idontpost · on March 14, 2023

Stop worshipping the machine. It's sad.

albertgoeswoof · on March 14, 2023

How could you prove this?

fancyfredbot · on March 14, 2023

People have shown GPT has an internal model of the state of a game of Othello:

Https://arxiv.org/abs/2210.13382

pja · on March 14, 2023

More accurately: a GPT derived DNN that’s been specifically trained (or fine-tuned, if you want to use OpenAI’s language) on a dataset of Othello games ends up with an internal model of an Othello board.

It looks like OpenAI have specifically added Othello game handling to chat.openai.org, so I guess they’ve done the same fine-tuning to ChatGPT? It would be interesting to know how good an untuned GPT3/4 was at Othello & whether OpenAI has fine-tuned it or not!

(Having just tried a few moves, it looks like ChatGPT is just as bad at Othello as it was at chess, so it’s interesting that it knows the initial board layout but can’t actually play any moves correctly: Every updated board it prints out is completely wrong.)

WoodenChair · on March 14, 2023

> it’s interesting that it knows the initial board layout

Why is that interesting? The initial board layout would appear all the time in the training data.

brokensegue · on March 15, 2023

the initial board state is not ever encoded in the representation they use. imagine deducing the initial state of a chess board from the sequence of moves.

thomastjeffery · on March 14, 2023

The state of the game, not the behavior of playing it intentionally. There is a world of difference between the two.

It was able to model the chronological series of game states that it read from an example game. It was able to include the arbitrary "new game state" of a prompt into that model, then extrapolate that "new game state" into "a new series of game states".

All of the logic and intentions involved in playing the example game were saved into that series of game states. By implicitly modeling a correctly played game, you can implicitly generate a valid continuation for any arbitrary game state; at least with a relatively high success rate.

LeanderK · on March 15, 2023

As I see it, we do not really know much about how GPT does it. The approximations can be very universal so we do not really know what is computed. I take very much issue with people dismissing it as "pattern matching", "being close to the training data", because in order to generalise we try to learn the most general rules and through increasing complexity we learn the most general, simple computations (for some kind of simple and general).

But we have fundamental, mathematical bounds on the LLM. We know that the complexity is at most O(n^2) in token length n, probably closer to O(n). It can not "think" about a problem and recurse into simulating games. It can not simulate. It's an interesting frontier, especially because we have also cool results about the theoretical, universal approximation capabilities of RNNs.

thomastjeffery · on March 15, 2023

There is only one thing about GPT that is mysterious: what parts of the model don't match a pattern we expect to be meaningful? What patterns did GPT find that we were not already hoping it would find?

And that's the least exciting possible mystery: any surprise behavior is categorized by us as a failure. If GPT's model has boundaries that don't make sense to us, we consider them noise. They are not useful behavior, and our goal is to minimize them.

calf · on March 14, 2023

So does AlphaGo has an internal model of Go's game theoretic structures, but nobody was asserting AlphaGo understands Go. Just because English is not specifiable does not give people an excuse to say the same model of computation, a neural network, "understands" English any more than a traditional or neural algorithm for Go understands Go.

valine · on March 14, 2023

Just spitballing, I think you’d need a benchmark that contains novel logic puzzles, not contained in the training set, that don’t resemble any existing logic puzzles.

The problem with the goat question is that the model is falling back on memorized answers. If the model is in fact capable of cognition, you’d have better odds of triggering the ability with problems that are dissimilar to anything in the training set.

henry2023 · on March 14, 2023

Maybe Sudokus? Sudokus are np-complete and getting the "pattern" right is equivalent to abstracting the rules and solving the problem

fl0id · on March 14, 2023

You would first have to define cognition. These terms often get thrown around. Is an approximation of a certain thing cognition? Only in the loosest of ways I think.

imtringued · on March 14, 2023

The problem is even if it has this capability, how do you get it to consistently demonstrate this ability?

It could have a dozen internal reasoning networks but it doesn't use them when you want to.

theodorejb · on March 14, 2023

> If, during training, some collection of weights exist along the gradient that approximate cognition

What do you mean? Is cognition a set of weights on a gradient? Cognition involves conscious reasoning and understanding. How do you know it is computable at all? There are many things which cannot be computed by a program (e.g. whether an arbitrary program will halt or not)...

Idiot_in_Vain · on March 14, 2023

You seem to think human consious reasoning and understanding are magic. The human brain is nothing more than a bio computer and it can't compute either, whether an arbitrary program will halt or not. That doesn't stop it from being able to solve a wide range of problems.

theodorejb · on March 14, 2023

> The human brain is nothing more than a bio computer

That's a pretty simplistic view. How do you know we can't determine whether an arbitrary program will halt or not (assuming access to all inputs and enough time to examine it)? What in principle would prevent us from doing so? But computers in principle cannot, since the problem is often non-algorithmic.

For example, consider the following program, which is passed the text of the file it is in as input:

  function doesHalt($program, $inputs): bool {...}

  $input = $argv[0]; // contents of this file

  if (doesHalt($input, [$input])) {
      while(true) {
          print "Wrong! It doesn't halt!";
      }
  } else {
      print "Wrong! It halts!";
  }

It is impossible for the doesHalt function to return the correct result for the program. But as a human I can examine the function to understand what it will return for the input, and then correctly decide whether or not the program will halt.

bidirectional · on March 14, 2023

Can you name a single form of analysis which a human can employ but would be impossible to program a computer to perform?

Can you tell me if a program which searches for counterexamples to the Collatz conjecture halts?

Turing's entire analysis started from the point of what humans could do.

ogogmad · on March 14, 2023

This is a silly argument. If you fed this program the source code of your own brain and could never see the answer, then it would fool you just the same.

theodorejb · on March 15, 2023

You are assuming that our minds are an algorithmic program which can be implemented with source code, but this just begs the question. I don't believe the human mind can be reduced to this. We can accomplish many non-algorithmic things such as understanding, creativity, loving others, appreciating beauty, experiencing joy or sadness, etc.

ogogmad · on March 15, 2023

> You are assuming

Your argument doesn't disprove my assumption *. In which case, what's the point of it?

* - I don't necessarily believe this assumption. But I do dislike bad arguments.

kaba0 · on March 20, 2023

Here you are:

  func main() {

    var n = 4;
  OUTER: loop {
      for (var i = 2; i < n/2; i++) {
        if (isPrime(i) && isPrime(n-i)) {
          n += 2;
          continue OUTER; // Goldbach’s conjecture 
      }
      break;
    }
  }

TchoBeer · on March 14, 2023

actually a computer can in fact tell that this function halts.

And while the human brain might not be a bio-computer, I'm not sure, its computational prowess are doubtfully stronger than a quantum turing machine, which can't solve the halting problem either.

laszlokorte · on March 14, 2023

no you can't. only for some of the inputs. and for those you could also write an algorithmic doesHalt function that is analog to your reasoning.

theodorejb · on March 14, 2023

For what input would a human in principle be unable to determine the result (assuming unlimited time)?

It doesn't matter what the algorithmic doesHalt function returns - it will always be incorrect for this program. What makes you certain there is an algorithmic analog for all human reasoning?

ellis-bell · on March 14, 2023

Well, wouldn't the program itself be an input on which a human is unable to determine the result (i.e., if the program halts)? I'm curious on your thoughts here, maybe there's something here I'm missing.

The function we are trying to compute is undecidable. Sure we as humans understand that there's a dichotomy here: if the program halts it won't halt; if it doesn't halt it will halt. But the function we are asked to compute must have one output on a given input. So a human, when given this program as input, is also unable to assign an output.

So humans also can't solve the halting problem, we are just able to recognize that the problem is undecidable.

theodorejb · on March 14, 2023

With this example, a human can examine the implementation of the doesHalt function to determine what it will return for the input, and thus whether the program will halt.

Note: whatever algorithm is implemented in the doesHalt function will contain a bug for at least some inputs, since it's trying to generalize something that is non-algorithmic.

In principle no algorithm can be created to determine if an arbitrary program will halt, since whatever it is could be implemented in a function which the program calls (with itself as the input) and then does the opposite thing.

ogogmad · on March 15, 2023

The flaw in your pseudo-mathematical argument has been pointed out to you repeatedly (maybe twice by me?). I should give up.

laszlokorte · on March 15, 2023

With a assumtion of unlimited time even a computer can decide the halting problem by just running the program in question to test if it halts. The issue is that the task is to determine for ALL programs if they halt and for each of them to determine that in a FINITE amount of time.

> What makes you certain there is an algorithmic analog for all human reasoning?

(Maybe) not for ALL human thought but at least all communicatable deductive reasoning can be encoded in formal logic. If I give you an algorithm and ask you to decide if it does halt or does not halt (I give you plenty of time to decide) and then ask you to explain to me your result and convince me that you are correct, you have to put your thoughts into words that I can understand and and the logic of your reasoning has to be sound. And if you can explain to me you could as well encode your though process into an algorithm or a formal logic expression. If you can not, you could not convince me. If you can: now you have your algorithm for deciding the halting problem.

ogogmad · on March 14, 2023

You don't get it. If you fed this program the source code of your mind, body, and room you're in, then it would wrong-foot you too.

theodorejb · on March 14, 2023

Lol. Is there source code for our mind?

ogogmad · on March 14, 2023

There might be or there mightn't be -- your argument doesn't help us figure out either way. By its source code, I mean something that can simulate your mind's activity.

glenstein · on March 14, 2023

Exactly. It's moments like this where Daniel Dennett has it exactly right that people run up against the limits of their own failures of imagination. And they treat those failures like foundational axioms, and reason from them. Or, in his words, they mistake a failure of imagination for an insight into necessity. So when challenged to consider that, say, code problems may well be equivalent to brain problems, the response will be a mere expression of incredulity rather than an argument with any conceptual foundation.

DontchaKnowit · on March 15, 2023

And it is also true to say that you are running into the limits of your imagination by saying that a brain can be simulated by software : you are falling back to the closest model we have : discrete math/computers, and are failing to imagine a computational mechanism involved in the operation of a brain that is not possible with a traditional computer.

The point is we currently have very little understanding of what gives rise to consciousness, so what is the point of all this pontificating and grand standing. Its silly. We've no idea what we are talking about at present.

Clearly, our state of the art models of nueral-like computation do not really simulate consciousness at all, so why is the default assumption that they could if we get better at making them? The burden of evidence is on conputational models to prove they can produce a consciousness model, not the other way around.

ogogmad · on March 15, 2023

This doesn't change the fact that the pseudo-mathematical argument I was responding to was a daft one.

valine · on March 14, 2023

Neural networks are universal approximators. If cognition can be represented as a mathematical function then it can be approximated by a neural network.

If cognition magically exists outside of math and science, then sure, all bets are off.