This thing takes an image and creates a representation matrix.
> connect it to Vicuna-13B with a linear layer
Vicuna is an open LLM, pretty good quality, not as good as GPT3.5 though.
This is the beautiful part - a mere multiplication is enough to convert the image tensor to text tensor. One freaking line of code, and a simple one.
> and train just the tiny layer on some datasets of image-text pairs
You then get a shitload of image-text pairs and train the model to describe the images in text. But keep both the image and text model frozen. Is that hard? No, just flip a flag. So this "linear projection layer" (a matrix multiplication) is the only learned part. That means it takes less time to train, needs fewer examples and requires less memory.
Training the image and text models was much more difficult. But here we don't train these models, they use them as ready-made parts. It's a hack on top of two unrelated models, so it is cheap.
In the end the finishing touches - they label 3500 high quality image-text pairs, and fine-tune on them. Now the model becomes truly amazing. It has broad visual intelligence, and scooped OpenAI who didn't release Image GPT-4 in the APIs yet.
The important lesson to take is that unrelated models can be composed together with a bit of extra training for the glue model. And that open AI is just as powerful as "Open"AI sometimes. It's breathing down their necks, just one step behind. This model is also significant for applications - it can power many automations in a flexible way.
> This is the beautiful part - a mere multiplication is enough to convert the image tensor to text tensor. One freaking line of code, and a simple one.
I thought they were creating image tokens based on the queries during finetuning and appending them to the language model. They are not text tokens.
This thing takes an image and creates a representation matrix.
> connect it to Vicuna-13B with a linear layer
Vicuna is an open LLM, pretty good quality, not as good as GPT3.5 though.
This is the beautiful part - a mere multiplication is enough to convert the image tensor to text tensor. One freaking line of code, and a simple one.
> and train just the tiny layer on some datasets of image-text pairs
You then get a shitload of image-text pairs and train the model to describe the images in text. But keep both the image and text model frozen. Is that hard? No, just flip a flag. So this "linear projection layer" (a matrix multiplication) is the only learned part. That means it takes less time to train, needs fewer examples and requires less memory.
Training the image and text models was much more difficult. But here we don't train these models, they use them as ready-made parts. It's a hack on top of two unrelated models, so it is cheap.
In the end the finishing touches - they label 3500 high quality image-text pairs, and fine-tune on them. Now the model becomes truly amazing. It has broad visual intelligence, and scooped OpenAI who didn't release Image GPT-4 in the APIs yet.
The important lesson to take is that unrelated models can be composed together with a bit of extra training for the glue model. And that open AI is just as powerful as "Open"AI sometimes. It's breathing down their necks, just one step behind. This model is also significant for applications - it can power many automations in a flexible way.