CLIP doesn't do captioning, it just generates embeddings. And it's contrastive, ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		gwern on March 14, 2023 \| parent \| context \| favorite \| on: GPT-4 CLIP doesn't do captioning, it just generates embeddings. And it's contrastive, so it would work poorly for this kind of task: anything 'relational' falls apart immediately. (See for example the DALL-E 2 results for these kinds of captions/tasks.) It's almost certainly a VQ-VAE-style encoding of the image itself into a sequence of tokens, as was done by DALL-E 1, CM3, Gato and a whole bunch of more recent models. It's the very obvious thing to do, and their context window is more than large enough now.

GaggiX on March 14, 2023 [–]

This way the model would also be able to generate images, I would also be curious how they handle images with different aspect ratios (and maybe resolution so it can read well on papers).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact