The model only has about 1B parameters which is relatively small.
The language models that produced very impressive results have >>50B parameters, e.g. GPT-3 with 175B, Aleph Alpha Luminous (200B), Google PaLM (540B). GPT-3 can understand and answer basic trivia questions, and impressively mimic various writing styles, but it fails at basic arithmetic. PaLM can do basic arithmetic much better and explain Jokes. Dall-E 2 (specialized on image generation) has 3.5B parameters for the image generation alone and it uses a 15B language model to read in text (a version of GPT-3).
Imagine what the alternative would imply. AI would be solved, and thus, intelligence itself. Predicting tokens is not actually true intelligence, and that’s not really the point of these models. This is a step on the letter, not the rooftop. It looks a lot like we’ll get there though, if you compare the state of the art to ANYTHING labeled AI five years ago. Thats the exciting part.
[edit] to emphasize: predicting tokens is a very interesting mechanic, but in a design of intelligent software, it would be no more than that: the mechanic of one or more of its components/modules/subsystems. The real deal is to figure out what those components are. Once you have that part done, you can implement it in a language of your choice, be it token prediction, asm or powerpoint :-)
Yeah, the captions are in the right arena but fundamentally wrong. In the baseball picture it recognizes the ball, pitcher, and the act of throwing, but calls the action wrong. Its object recognition and pattern matching are excellent, but higher level thinking and self-correction are totally absent.
Which is exactly where GPT, etc., are capping out. Its easier to see the flaws in this one because its more general, so spread out more thinly.
To get to the next step (easy to say from an armchair!), these models need a sense of self and relational categories. Right now a 5-year old can tell a more coherent story than GPT. Not more sophisticated, but it will have a central character and some tracking of emotional states.
> Its easier to see the flaws in this one because its more general, so spread out more thinly.
I really think this is due to the very limited number of parameters in GATO: 1.2B vs. 175B for GPT-3. They intentionally restricted the model size so that they could control a robot arm (!) in real time.
> these models need a sense of self and relational categories.
The places where I personally see GPT-3 getting hung up on higher level structure seem very related to the limited context window. It can't remember more than a few pages at most, so it essentially has to infer what the plot is from a limited context window. If that's not possible, then it either flails (with higher temperatures) or outputs boring safe completions that are unlikely to be contradicted (with lower temperatures)
It's a very small model, I think due to the intent to use it for robotics. It's not that it's good per se, even if it were just a language model it would be smaller than GPT-2, it's that it's bad at a lot of different things. I hope to see analysis into how much of it is multi-purpose, but as of now it's looking really cool
That could be solved with accurate lookups from trusted sources. Humans do the same thing, we have associations and trusted facts. AI has the associations, they just need to add the trusted facts compendium. "Hmm I know that Marseille is associated with France, but I don't remember the capitol, Hey Google.."
> What is the capital of France? > Marseille
And many of the generated image captions are inaccurate.