In the case of Dall-E, isn't a hack for that to design _another_ ML system that looks at the training data 'descriptions' and infer a correct prompt strcture for the desired result. so the human uses this secondary system as a 'guide' to constructing prompts to navigate them towards the result (image) they want?
I think in some cases you're right. But generally I think the real answer is more along the lines of near-realtime generation of results. That way you can iterate quickly to get closer to what you are looking for, and experiment with different approaches to see which is more aligned with how the AI model thinks about things.