Hacker News new | past | comments | ask | show | jobs | submit login

Wouldn't that be more applicable to image generation, or at least wanting to encode the image as a whole?

If you need to be able to reason about multiple objects in the image and their relative positions, then don't you need to use a tiled approach?




VQVAE is trained to reconstruct the image, so in theory it should contain all the information (both content and location) inside its embeddings.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: