Wouldn't that be more applicable to image generation, or at least wanting to enc... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

HarHarVeryFunny on June 7, 2024 | parent | context | favorite | on: How Does GPT-4o Encode Images?

Wouldn't that be more applicable to image generation, or at least wanting to encode the image as a whole?

If you need to be able to reason about multiple objects in the image and their relative positions, then don't you need to use a tiled approach?

rafaelero on June 7, 2024 [–]

VQVAE is trained to reconstruct the image, so in theory it should contain all the information (both content and location) inside its embeddings.

Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact