Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> particularly interested in training a hypernetwork to translate natural language instructions into latent-space navigation instructions with the end goal of enabling me to give the model natural-language feedback on its generations.

What are you doing exactly?



Imagine every conceivable image is laid out on the ground, images which are similar to each other are closer together. You’re looking at an image of a face. Some nearby images might be happier, sadder, with different hair or eye colours, every possibility in every combination all around it. There are a lot of images, so it is hard to know where to look if you want something specific, even if it is nearby. They’re going to write software to point you in the right direction, by describing what you want in text.

Here’s an example of this sort of manipulation: https://arxiv.org/abs/2102.01187


AFAICT: making a navigation/direction model that can translate phrase-based directions into actual map-based directions, with the caveat that the model would be updated primarily by giving it feedback the same way that you would give a person feedback.

Sounds only a couple of steps removed from basically needing AGI?


I suspect you’d want to start by trying to translate differences between images into descriptive differences. Maybe you could generate examples by symbolic manipulation to generate pairs of images or maybe nlp can let us find differences between pairs of captions? Large nlp models already feel pretty magical to me and encompass things that we would have said required AGI until recently so it seems possible, though really tough




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: