There are multiple open-source GPTs, but GPT-3 is absolutely massive - larger than the image models actually! So, unfortunately, text generation is probably even more complex and resource intensive than image generation (especially to train). Additionally, in image generation, we appreciate the creativity of solutions, but in text generation the creative solutions seem like utter nonsense.
I guess my intuition is based on the file size of text being so much smaller than images, but I guess that doesn't really map to the complexity of generating it. Fascinating!
I think large language models are still in their infancy. The models are extremely sparse, but we don‘t have the tooling yet to deal with these kinds of structures efficiently. Your intuition might be right in a future, maybe.
If you think about the space both models are covering from a rate-of-failure perspective, it kind of makes sense that images end up being a bit easier than text: both text- and image-models can output results that look plausible at first glance, but when you analyze both outputs further there are a lot more gotchas in parsing meaning within language than there are in pixel placement within an image.
Actually, I'd argue that images generated by SD have much more flaws than texts produced by GPT-3. GPT-3 (at least, the full model) is quite capable of writing stuff that "make sense", but most eye candy results generated by SD are cherry picked, the others are simply crap.
That's kinda the point here I think. GPT-3 is trained on much more data than SD and contains much more knowledge. SD is actually similar in size to GPT-2.
Image models of the same size as GPT-3 should be much more impressive, the difference will probably be quite large like the difference between GPT-2 and GPT-3.
Ask SD to write a step by step guide to do something and it will create an image that looks kinda like some instructions, but the contents will be nonsense.
An image model of the size of GPT-3 could probably do this task quite well in many cases.
Image models needs much better language understanding to get to the next level also though, so probably multimodal models may make more sense. Maybe feeding web pages rendered as images to an image model could give interesting results.