It's really nice but I don't understand why they keep pushing with the idea of text-to-image - text is not a great medium for describing visual scenes, no one in the real world who's working on real content authoring actually uses textual descriptions.
Why not allow for more photoshop, freehand art (or 3d editor ) style controls, which are much simpler to parse than textual descriptions
Nvidia canvas existed before text to image models but it didn't gain as much popularity with the masses.
The other part is the training data - there are masses of (text description, image) pairs whilst if you want to do something more novel you may struggle to find a big enough dataset.
Image/video generation could possibly be used to advance LLMs in quite a substantial way:
If the LLM during it's "thinking" phase encountered a scenario where it had to imagine a particular scene (let's say a pink elephant in a hotel lobby), then it could internally generate that image and use it to aid in world-simulation / understanding.
All of this already exists in various forms: inpainting lets you make changes by masking over sections of a image, control nets let you guide the generation of an image through many different forms ranging from depth maps to posable figures, etc.
Why not allow for more photoshop, freehand art (or 3d editor ) style controls, which are much simpler to parse than textual descriptions