BLIP2 is a contrastive Image-Language model. The embeddings from the BLIP2 image model are already both aligned with text, and linear. It should not be a surprise that only a projection is required to translate it to LLaMA's embedding space.
as well as this - https://llava-vl.github.io/, Just found this paper that demonstrated this a few months ago (that somehow language and vision models learn representations similar enough that linear projection is enough) https://arxiv.org/abs/2209.15162