Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's really nice but I don't understand why they keep pushing with the idea of text-to-image - text is not a great medium for describing visual scenes, no one in the real world who's working on real content authoring actually uses textual descriptions.

Why not allow for more photoshop, freehand art (or 3d editor ) style controls, which are much simpler to parse than textual descriptions



Accessability and training data.

Nvidia canvas existed before text to image models but it didn't gain as much popularity with the masses.

The other part is the training data - there are masses of (text description, image) pairs whilst if you want to do something more novel you may struggle to find a big enough dataset.


Image/video generation could possibly be used to advance LLMs in quite a substantial way:

If the LLM during it's "thinking" phase encountered a scenario where it had to imagine a particular scene (let's say a pink elephant in a hotel lobby), then it could internally generate that image and use it to aid in world-simulation / understanding.

This is what happens in my head at least!


These things are not mutually exclusive.

All of this already exists in various forms: inpainting lets you make changes by masking over sections of a image, control nets let you guide the generation of an image through many different forms ranging from depth maps to posable figures, etc.


> no one in the real world who's working on real content authoring actually uses textual descriptions

As someone who owns an AI image SaaS making over 100k per month this made me chuckle


Dang, you are so cool and so smart!


Sorry, what do you mean? I didn't say that




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: