Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

BLIP2 is a contrastive Image-Language model. The embeddings from the BLIP2 image model are already both aligned with text, and linear. It should not be a surprise that only a projection is required to translate it to LLaMA's embedding space.


apparently you can project directly with CLIP. See here - https://llava-vl.github.io/. This seems pretty wild to me.


That seems pretty wild to me too.


This is the best answer. It makes sense to me. Thank you :-)


as well as this - https://llava-vl.github.io/, Just found this paper that demonstrated this a few months ago (that somehow language and vision models learn representations similar enough that linear projection is enough) https://arxiv.org/abs/2209.15162


Thank you for sharing this. I would not have expected that. It does seem pretty wild.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: