While this is a very cool project that shows a great use of machine learning to answer questions about images in a roughly explainable way, I think people are extrapolating quite a bit as though this is some kind of movement forward from GPT-4 or Midjourney 5 into a new advanced reasoning phase, rather than a neat new combination of stuff that existed a year ago.
Firstly, a bunch of the tech here is recognition-based rather than generative; it is relying heavily on object recognition which is not new.
Secondly, the two primary spaces where generative tech is used are
1. For code generation from simple queries over a well-defined (and semantically narrow) spatial API — this is one of the tasks where generative AI should shine in most cases. And
2. As a punt for something the API doesn't allow: e.g. "tell me about this building", which then comes with the same inscrutability as before.
The number of examples for which the code is essentially "create a vector of objects, sort them on the x, y, z, or t axis, and pick an index" is quite high. But there aren't really any examples of determining causality or complex relationships that would require common sense. It is basically a more advanced SHRDLU. That's not to say this isn't a very cool result (with an equally cool presentation). And I could see some applications where this tech is used to achieve ad-hoc application of basic visual rules to generative AI (for example, Midjourney 6 could just regenerate images until "do all hands in this image have five fingers?" is true).
> I think people are extrapolating quite a bit as though this is some kind of movement forward from GPT-4 or Midjourney 5 into a new advanced reasoning phase, rather than a neat new combination of stuff that existed a year ago.
It can be both. Life itself was a "neat combination of stuff that existed" before. It isn't about the raw ingredients, but the capability of their whole.
Also, history as shown that are periods of time where rapid progress happens. It looks like we are in one of those, and it will make the previous ones look like baby steps.
I interpret your question as "although ViperGPT is innovative, it is not as radical as GPT-4 or Midjourney 5". Here, "radical innovation" is a term I have used from the innovation literature. (https://bigthink.com/plus/radical-vs-disruptive-innovation-w...)
Although I largely agree with you, I still think this is a massive development as it will likely change the way empiricists use computer vision.
Firstly, a bunch of the tech here is recognition-based rather than generative; it is relying heavily on object recognition which is not new.
Secondly, the two primary spaces where generative tech is used are
1. For code generation from simple queries over a well-defined (and semantically narrow) spatial API — this is one of the tasks where generative AI should shine in most cases. And
2. As a punt for something the API doesn't allow: e.g. "tell me about this building", which then comes with the same inscrutability as before.
The number of examples for which the code is essentially "create a vector of objects, sort them on the x, y, z, or t axis, and pick an index" is quite high. But there aren't really any examples of determining causality or complex relationships that would require common sense. It is basically a more advanced SHRDLU. That's not to say this isn't a very cool result (with an equally cool presentation). And I could see some applications where this tech is used to achieve ad-hoc application of basic visual rules to generative AI (for example, Midjourney 6 could just regenerate images until "do all hands in this image have five fingers?" is true).