I've seen threads involving Gary Marcus on twitter, and when people provide concrete evidence that his point of view is wrong, he just stops replying, then proceeds to continue to spout the disproven claim in other places.
Just the other day, when he claimed that GPT was literally just doing memorization/word statistics/syntax and has no grasp of semantics, some folks demonstrated that GPT can literally act as an interpreter for arbitrary code.
(there are some tricks involved here, you have to get it to interpret the program "with pen and paper" by getting it to record all the state updates / variable mutations that happen in the code, this can be done by inserting copious print statements)
He then claimed that it was only able to interpret this program because it must have seen it before in its vast training data. He accused a commenter of not understanding how big the training data was.
It was then shown that GPT can interpret a python program that is operating on two randomly chosen large integers, the combination of which is certainly not in its training data. This shows that it must be "understanding" (for lack of a better word) the semantics of the program. Gary then stopped responding.
I don't think GPT on its own will lead anywhere close to AGI -- and I don't think anyone serious thinks this. But GPT combined with a sophisticated pipeline of wrapper scripts to feed its own output back into itself in clever ways and give it access to external data sources and tools? Possibly!
> I don't think GPT on its own will lead anywhere close to AGI
Serious question: why isn't ChatGPT already considered AGI? It looks general to me in its domain (text). It can indifferently: understand and execute instructions, even when provided in incomplete or ambiguous form; it can reason step by step; it can compose simple poems; it can attempt explaining jokes, it can explain and modify code, and reacts appropriately to almost any request. In all this it shows a perfectly natural "understanding" of language and context, integrated with a good amount of common sense and knowledge about the world. Of course it's not perfect, its limits show pretty quickly in almost all these fields. But still, it's general within its own domain..Aren't we just moving goalposts?
One of your examples is the key: “attempt” explaining jokes. Really, it’s “attempting” to do all of the items you list, and can seemingly do only very simple ones. Try and get it to produce a poem that doesn’t rhyme. Try to get it to be internally consistent when explaining something abstract. Play the two truths and a lie game with it. It doesn’t understand any of the items you listed because it has no concepts, just math.
> Try and get it to produce a poem that doesn’t rhyme
Good one, I hadn't tried that. Seems that the mention of a "poem" puts it in a special frame from which it finds very hard to escape (I managed to get it to avoid rhyming in the first lines, but then it reverts to rhymes).
However,
> it’s “attempting” to do all of the items you list
Yes, that's ok. I don't think we should confuse AGI with human-level of even superintelligence: it's perfectly ok for ChatGPT to try and not quite make it on this or that task. General intelligence doesn't mean being able to perform every task that can be performed by at least one human being; by that metric we would all fail. Neither it means not behaving in any obviously obtuse way in every situation, we would also fail that test given enough time. ChatGPT displays understanding of context, common sense, metaphors, inference, intuition, abductive reasoning. It actually answers correctly to tricky questions or scenarios that have been brought in the past as examples of what an AGI should be capable of. It's clear that it's not good enough yet to perform even a simple intellectual job reliably, but that's not really what AGI is. Have you ever interacted with people with Down's syndrome? Would you ever say that they lack general intelligence?
The main point here is that Large Language Models are crushing through barriers that people like Gary Marcus have previously deemed impossible to surpass.
Check past episodes by these guys on Machine Learning Street Talk podcast (the whole "symbolic AI" crowd) for some gold quotes that just seem silly now!
I mean, we are not that far from 2017, when LLM's couldn't really write coherent text. And, for example, we had people saying then that they would never be really coherent because text is too sparse, see Chomsky etc. etc. Instead, all they would do is to copy+paste training examples. Naturally, these people didn't really understand how transformers form interdependencencies and how they are composing sequences in a much more complex fashion - as by now is obvious. However, at they time, and with their crude understanding of these models, these guys sure were confident to be right.
Next was Questions Answering, Reasoning, Math problems, Deduction, Coding... each things that LLMs could never do! They were so sure of it. And sure enough, a new model comes around that does it pretty well.
Researchers are bitter for two reasons.
First, We do not say that LLMs are perfect or general AI. Nobody says this. It's an obvious point not worth arguing over.
But Gary Marcus et al. have become extremely popular. It's probably because some readers are slightly concerned about AI, and want someone to tell them that AI isn't really AI yet (duh) and won't replace their precious job (yet?). Their point is a strawman, and what goes beyond it (AI can't ever... X) is mostly wrong.
Second, however, these people have never contributed to the actual models. They are not part of the progress that - notably - is crushing through these barriers. They are bystanders.
And it's freaking annoying, because their understanding of LLM's is imperfect (as is the case for everyone!), and yet they come up with these statements of absolute certainty.
There is indeed research into these questions. But it is far, very far from resolved. In that sense, these people are - sorry to say - charlatans.
Well at least, everyone I know rolls their eye when the next Gary Marcus article or tweet comes around.
There's a deluge of papers and research right now, so there's quite a bit of complexity to saying "pretty well".
However, let me say it this way:
Compared to other LLMs, recent OpenAI models score highly on logic and math exercises. Yes, there ARE better LLM's trained to do math computations (especially fine-tuned for certain problems), but I'd say ChatGPT is certainly impressive as a general text and code model.
The other side of the medallion of saying "pretty well" is this:
There is no other type of computational approach that is able to solve free-form logic or math queries in any capacity. There is no symbolic approach that can "extract" a math problem from text and then solve it - in code or otherwise, whereas LLMs are getting close to human performance on such (and related tasks).
As I understand that paper, I would take issue that the LLM isn’t “doing” math at all. Look at Figure 4 for the process to be most clear. It’s generating a text program that when run via Python can solve the problem. All the LLM is doing is matching the equation in the question to whatever operators it needs in Python syntax. I wouldn’t call that doing or understanding math.
So yes, you’ll probably want to take a model trained on math problems to forgoe the step with code.
But then, note that writing correct programs is indeed a high level solution to a full text math problem, is it not? Going from there to solving it directly should be a matter of some tuning.
Finally, who said understanding math?
The whole debate is about getting shockingly useful results precisely without symbolic reasoning.
If the main contention is that got does not do symbolic reasoning, then we are back at Gary Marcus… yes we know this. It’s not why researchers are so amazed by these models. Its that they output steps (or in this case code since its Codex based) solving university level math with a simple transformer architecture
It can't do math because it is operating in a single neural network path. You, also, cannot do math in a single neural network path. Even when you add two small numbers like 123+456 your brain is mentally iterating over the digits, detecting if a carry is needed, doing that, etc. That is, you have a looping/recursive process running inside your brain. You only output the final answer.
GPT does not have such a looping/recursive process inside its neural net. It's a fixed depth non-recursive neural net.
You can get it emulate recursive processes by prompting it with tricks like "think step by step". If you describe the addition algorithm you learn in elementary school (e.g. the digit-by-digit, carry if the sum exceeds 9, etc) in sufficient detail, it can execute that algorithm.
Gpt3 is a graph neural network with added recursive information through positioninal encoding. It outputs sequentially, but I am not sure why a RNN would be required beyond that.
I would agree that the manner of reasoning must differ, since the sequence follows bpe tokens rather than logical steps, however who is to say that another form of mathematical reasoning could not lead to valid results?
For instance, Gpt might solve the problem at each output step insofar as required to generate that token.
It certainly iterates over each output token, and the encoding of the problem is equivalent so iterating over the characters of the math problem, roughly speaking. But yes, the iterated output does not follow a logical graph externally, it is token by token. But internally, the network can absolutely follow a more complex graph.
Could you say what you mean by single network path when we speak about attention based architectures?
What sort of operation or information is missing in such architecture?
I am aware of of some results relating to certain graphs, but I do not think this would apply to a text describing a math problem, say.
ChatGPT is trained via reinforcement learning to give answers that - ultimately - human non-experts would judge to be plausible. Answers were not trained to be correct or accurate - hence we should not expect ChatGPT to do well there.
Sociologists have found that humans seek plausible answers more than correct answers, especially when trying to make sense of a situation. For that reason, the training objective makes sense for a chatbot.
Now, anyone who tries to seel ChatGPT as giving truthful answers is of course doing a disfavor to the model and its training objective.
However, anyone also trying to sell the above either as
a) Deep insight, or
b) inherent limitation of LLMs
is indeed a pot calling the kettle black.
Going from plausibility to truthfulness is probably not a giant leap - at least not compared to what was achieved in the past three years. There's active research on it, and I am sure a good solution will arise.
However, ChatGPT is not that, and it isn't meant to be. It's not the training objective.
Making this a huge point is again such a BS strawman as the other things I mentioned before.
> ChatGPT is trained via reinforcement learning to give answers that - ultimately - human non-experts would judge to be plausible.
Ostensibly, it is being trained give replies that are more likely than most to be given by humans. On the reasonable (at least IMHO) assumption that people's speech is biased towards saying things that their target audience might see as plausible, this can be seen as doing what you say in the above quoted.
This is not, however, the same as being able to judge the plausibility of the semantic content of those statements, so I remain skeptical that the current methods for training LLMs are capable of creating that ability, which in turn leads me to be skeptical that going from plausible statements to reasonably reliably true statements will be a relatively small step in comparison*. I'm open to being surprised, however, as I already have been by what has been achieved so far.
The LLM-does-math paper you mention in your sibling comment looks extremely interesting, and maybe I will change my mind...
* At least by a continuation or scaling-up of current methods. Humans may be more likely to say truthful things than falsehoods (at least in some areas and some contexts) but a gap has formed between plausible and truthful LLM productions, and I don't see any particular reason to think more of the same will close it.
What do you mean by current methods though?
ChatGPT is trained very differently method wise than GPT3 itself.
Whereas GPT3 was strongly based on sequential likelihood in a large corpus, ChatGPT is trained (i would claim) on producing outputs that are judged as good answer and hence plausible. Which is also why the recent iteration of GPT3 is still better than ChatGPT on some tasks.
Of course, a truthful model needs a new approach, and perhaps a new baseline model.
For instance, starting with code rather than text has been beneficial for recent GPT models, presumably by learning stronger reasoning compositionality.
Perhaps, some such baseline on truthfulness needs to be another starting point before large corpora language modeling.
Nevertheless, it seems to me this paradigm is more likely to be the way forward than any symbolic approach, say, which is not yet able to even produce text afaik.
I mean, even if we need a retrieval style fact checking component, I do now believe there’ll be a transformer equivalent LLM in any future model.
Unless the difference in training of GPT-3 and ChatGPT has led to an improvement in the veracity of the replies given, I think the differences in their training are moot in regard to the question of what it will take to get substantially true responses from them. For the reasons I gave in my previous post, I don't regard progress in appearing plausible as necessarily progress towards being able to tell truth from falsehoods.
To be clear, I am not claiming that a symbolic approach would be better. Pointing out the shortcomings of current methods does not provide evidence that symbolic approaches will succeed (or vice-versa, for that matter.)
This leaves code. On reading your statement about starting with it, it suddenly occurred to me that code is a constrained environment compared to the whole of human discourse. It is obvious, I think, that programming languages are extremely limited in what they can express in comparison to human languages, but perhaps less obviously (at least to me), it seems to follow that what humans can say about programs is also constrained (again, in comparison to human discourse in general - but not nearly as constrained as what can be said in the programming languages themselves.)
This leads me to agree with you in this respect (at least tentatively), and here's why: Currently, on Earth, we have one species whose individuals have a well-developed sense of themselves as agents in that world, a similarly well-developed theory of mind about others, and are adroit language users. A few other species have some of these abilities to a limited extent, but not nearly so well developed that they can fully make use of them. As an evolutionist, I suppose that some of our ancestor and related species had intermediate levels of these capabilities, and if we can make AIs that occupy that space we would be making progress.
In this view, attempting to match or surpass human performance in language use and understanding the world is about as difficult a target as we could have. As an intermediate target, the domain of program code and the things people say about it has some things going for it: not only is it more constrained (as argued above), but quite a lot has been written about it and is available for training. It is also (arguably) less likely than human language in general to contain statements intended to influence opinions without regard to the truth.
Yeah. I don't trust Gary Marcus, and I don't know why the media buys into his persona.
Gary Marcus features a Forbes story in his Twitter bio, "7 Must-Read Books About Artificial Intelligence". That's an article which Gary Marcus paid for (that's what "Forbes Contributor" means; they're cheap, too!). This makes alarm bells go off.
Marcus was one of the founders of "Geometric Intelligence", which was acquired by Uber. 3 months later, Marcus left Uber, and claimed he remained a "special advisor"[0] to Uber; when Recode said he was no longer employed at all[1]. By my reading, it's possible Geometric Intelligence was just a patent troll, and was acquired simply for its patents[2][3].
Select extracts from that Wired piece:
> The company has filed for at least one patent, Marcus says. But it hasn't published research or offered a product
> But Marcus paints deep neural nets as an extremely limited technology, because the vast swaths of data needed to train them aren't always available. Geometric Intelligence, he says, is building technology that can train machines with far smaller amounts of data.
[uh oh; my BS detector just went off.]
I heard Marcus published papers on AI; does anyone know if they're any good?
Is this guy just a successful self-promoter? Why is he being paraded by media as the AI expert? Why does he sound so shady? (especially with that Forbes link, yikes; sorry but I can't take anyone seriously who pays for fake positive news stories).
(I should also add: when the media has "go-to" experts, they're not primarily selected for their expertise, per-se, but for how "available" and eager they are to respond to all interview requests; I've seen the other side of that curtain.)
Gary Marcus's takes aren't credible, for all the reasons you cited (Forbes contributor... roll eyes) and this, from the interview with Ezra:
>Take GPT-3. ChatGPT itself probably won’t let you do this. And you say to it, make up some misinformation about Covid and vaccines. And it will write a whole story for you, including sentences like, “A study in JAMA” — that’s one of the leading medical journals — “found that only 2% of people who took the vaccines were helped by it.” You have a news story that looks like... it was written by a human being. It’ll have all the style and form, making up its sources and data. And humans might catch one, but what if there are 10 or 100 or 1,000 or 10,000 of these? Then it becomes very difficult to monitor them.
That's absurd. First, because if there's a JAMA study (love how he explains that it is Journal of American Medical Association) that is used to support even mediocre science journalism, e.g. an opinion piece in the Wall Street Journal or Newsweek, then it has an inline link to the JAMA study. Both CNN and Fox do similarly! After getting burned enough times, their reporters even learned to distinguish between MedrXiv and peer-reviewed articles. GPT-3 doesn't make fake URLs with fake associated JAMA articles.
Computational journalism has been around for a long time. Gary Marcus underestimates and lacks understanding of humans AND GPT3! If GPT3 spews 10,000 fake, unsourced COVID vaccine efficacy news articles, that isn't an existential risk to humanity.
GPT3 is impressive but it can't do everything (yet?) e.g. it has trouble taking derivatives of funcitons. It will give reasonable sounding answers to StackOverflow questions but the substance will be incorrect. Yes, it's annoying, but the StackOverflow OP will realize and look for help elsewhere. Gary's take:
>Now everybody in the programming field uses Stack Overflow all the time. It’s like a cherished resource for everybody. It’s a place to swap information. And so many people put fake answers on this thing where it’s humans ask questions, humans give answers, that Stack Overflow had to ban people putting computer-generated answers there. It was literally existential for that website. If enough people put answers that seemed plausible but we’re not actually true, no one would go to the website anymore.
Best of all:
>And imagine that on a much bigger scale, the scale where you can’t trust anything on Twitter or anything on Facebook... because you don’t know which parts are true and which parts are not.
Anyone who blindly trusts what they read on Twitter or Facebook has bigger problems than ChatGPT. How naive does Gary think people are?!
EDIT: I am more concerned by coding contests where GPT-3 DOES run circles around human contestants.
> then it has an inline link to the JAMA study. Both CNN and Fox do similarly!
I thought he was pretty clearly saying the JAMA study in his example would be made up by chatgpt. I've seen it make up lots of fake citatations so maybe I just had that back of mind and assumed wrong?
> Gary Marcus doesn't realize that spewing 10,000 erroneous COVID vaccine efficacy news articles isn't among them.
He caveated in the interview about this one or a similar one that the safety filter may stop it.
> Anyone who blindly trusts what they read on Twitter or Facebook has bigger problems than ChatGPT. How naive does Gary think people are?!
He's clearly saying it is a matter of degree when he details the submachinegun analogy to it, and a matter of cost when he compares actual costs of a human troll-farm.
I think he was wrong about several things, mostly in dismissing and its generalization ability and ability to come up with abstract metaphors related to the text, etc., but I don't think your comment is being very fair to him.
Just the other day, when he claimed that GPT was literally just doing memorization/word statistics/syntax and has no grasp of semantics, some folks demonstrated that GPT can literally act as an interpreter for arbitrary code.
(there are some tricks involved here, you have to get it to interpret the program "with pen and paper" by getting it to record all the state updates / variable mutations that happen in the code, this can be done by inserting copious print statements)
He then claimed that it was only able to interpret this program because it must have seen it before in its vast training data. He accused a commenter of not understanding how big the training data was.
It was then shown that GPT can interpret a python program that is operating on two randomly chosen large integers, the combination of which is certainly not in its training data. This shows that it must be "understanding" (for lack of a better word) the semantics of the program. Gary then stopped responding.
I don't think GPT on its own will lead anywhere close to AGI -- and I don't think anyone serious thinks this. But GPT combined with a sophisticated pipeline of wrapper scripts to feed its own output back into itself in clever ways and give it access to external data sources and tools? Possibly!