If a model can generate it, it can understand it. They can probably reverse engi...

		dinobones on Feb 15, 2024 \| parent \| context \| favorite \| on: Sora: Creating video from text If a model can generate it, it can understand it. They can probably reverse engineer this to build a multi-modal GPT that is fed video and understands what is going on. That's how you get "smart" robots. Active scene understanding via the video modality + conversational capabilities via the text/audio modality.

But we can already do this?