Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If a model can generate it, it can understand it.

They can probably reverse engineer this to build a multi-modal GPT that is fed video and understands what is going on. That's how you get "smart" robots. Active scene understanding via the video modality + conversational capabilities via the text/audio modality.



But we can already do this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: