> take BLIP2's ViT-L+Q-former This thing takes an image and creates a representa...

> take BLIP2's ViT-L+Q-former

This thing takes an image and creates a representation matrix.

> connect it to Vicuna-13B with a linear layer

Vicuna is an open LLM, pretty good quality, not as good as GPT3.5 though.

This is the beautiful part - a mere multiplication is enough to convert the image tensor to text tensor. One freaking line of code, and a simple one.

> and train just the tiny layer on some datasets of image-text pairs

You then get a shitload of image-text pairs and train the model to describe the images in text. But keep both the image and text model frozen. Is that hard? No, just flip a flag. So this "linear projection layer" (a matrix multiplication) is the only learned part. That means it takes less time to train, needs fewer examples and requires less memory.

Training the image and text models was much more difficult. But here we don't train these models, they use them as ready-made parts. It's a hack on top of two unrelated models, so it is cheap.

In the end the finishing touches - they label 3500 high quality image-text pairs, and fine-tune on them. Now the model becomes truly amazing. It has broad visual intelligence, and scooped OpenAI who didn't release Image GPT-4 in the APIs yet.

The important lesson to take is that unrelated models can be composed together with a bit of extra training for the glue model. And that open AI is just as powerful as "Open"AI sometimes. It's breathing down their necks, just one step behind. This model is also significant for applications - it can power many automations in a flexible way.