Text prompts aren't an essential part of this technology. They're being used as ...

Text prompts aren't an essential part of this technology. They're being used as the interface to generation APIs because it's easy to build, easy to moderate, and for the discord models like Midjourney it's easy for people to copy your work.

With a local model you can find latent space coordinates any way you want and patch the pixel generation model any way you want too. (the above are usually called textual inversion and LoRAs.)

I would personally like to see a system that can input and output layers instead of a single combined image.