I actually agree with you. I was a bit sarcastic. If I understand correctly there isn't a fundamental difference when it comes to text output vs pixel data output in this context. If so then it suddenly sounds much more of a stretch (intuitively) to claim that somehow stable diffusion understands the real world (like people claim to be the case with language models).