BLIP2 is a contrastive Image-Language model. The embeddings from the BLIP2 image...

og_kalu · on April 19, 2023

apparently you can project directly with CLIP. See here - https://llava-vl.github.io/. This seems pretty wild to me.

cs702 · on April 19, 2023

That seems pretty wild to me too.

cs702 · on April 17, 2023

This is the best answer. It makes sense to me. Thank you :-)

og_kalu · on April 19, 2023

as well as this - https://llava-vl.github.io/, Just found this paper that demonstrated this a few months ago (that somehow language and vision models learn representations similar enough that linear projection is enough) https://arxiv.org/abs/2209.15162

cs702 · on April 19, 2023

Thank you for sharing this. I would not have expected that. It does seem pretty wild.