Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I bet they use CLIP to caption the image and feed the text of the caption into GPT, but that's just a guess.



Did you check all of the samples provided? It can read an entire research paper and understand the figures just from the images of the papers pages. This seems to be a much deeper connection than extracting captions.


Are you sure? Sounds too epic


See the real examples for yourself, starting on page 34 ... mind-blowing.

https://cdn.openai.com/papers/gpt-4.pdf


The extreme ironing image example has a bullshit explanation in the paper. The extreme ironing on back of taxi is a popular photo with lots of text associated with that picture: https://google.com/search?q=extreme+ironing+taxi&tbm=isch

Give the model new images that are not in the training set (e.g. photos not on internet, or photos taken after model trained) and ask the same question and see how well it does!

The paper says: “Table 16. [snip] The prompt requires image understanding.”

I think the explanations (in the paper by OpenAI for the images) are probably misinformation or misdirection. I would guess it is recognising the images from it’s training and associating them with nearby text.


It seems like they used some unknown images in the livestream, see replies to: https://news.ycombinator.com/item?id=35157940

However, I still think they should not have used images from the internet/training set in their paper. And to be safe, neither should they use “generated” images.

I am looking forward to taking photos of some paintings by friends and seeing if ChatGPT can describe them!


It's SOTA on DocVQA[1] so yeah it is able to read text/graphs/tables from images

[1] https://www.docvqa.org/


CLIP doesn't do captioning, it just generates embeddings. And it's contrastive, so it would work poorly for this kind of task: anything 'relational' falls apart immediately. (See for example the DALL-E 2 results for these kinds of captions/tasks.)

It's almost certainly a VQ-VAE-style encoding of the image itself into a sequence of tokens, as was done by DALL-E 1, CM3, Gato and a whole bunch of more recent models. It's the very obvious thing to do, and their context window is more than large enough now.


This way the model would also be able to generate images, I would also be curious how they handle images with different aspect ratios (and maybe resolution so it can read well on papers).


There's no need to round-trip through text, you "just" need to train an embedding space that captures both domains.


You can look at Google's recent PaLM-E model for a possible approach. They use a vision transformer to tokenise the image (or to generate embeddings and then tokenise those?) and they also tokenise detected objects so the model can reason at a semantic level. Either way, it's been shown that these massive LLMs can handle images in tokenised form if you pretend it's text. In Google's case, the model is trained to look for sentinel values in the prompt (i.e. <img>) that denote images/objects are being sent.


They almost certainly generate tokens directly from the image. It would be extremely hard to generate short english descriptions which sufficiently describe the images to pass some of those benchmarks.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: