Sure, but this model is a general image synthesizer that is trained on a massive...

Sure, but this model is a general image synthesizer that is trained on a massive amount of data that is openly available on the internet. Given that, I would assume that it would have seen many images of 74xx series chips and also musical notation. So I would think that the most "likely" image to generate would involve either a chip with 8 pins on either side (for the 74ls173) or a note on a single leger line just below a staff with a treble clef. I imagine there must be hundreds of 74xx chip images that would establish that fact and also thousands of images of musical notation that would establish the other.

I guess the takeaway really is that the model does not function in such a way that it can recall its training data. Which is fine. I don't think I should expect it to. On the other hand, GPT-3 can be made to produce specific facts that are established by its training data. Although, admittedly it often gets things wrong. Maybe the problem of image modeling is just naturally harder than language modeling. After all, language already "directly" represents meaning in some sense much more than arbitrary images do.

I'm sure someone could design a targeted model that would solve the issues I'm talking about. But I feel as though they shouldn't have to if we really had something that sees the world the way humans do. In any case, this work definitely seems cutting edge and represents a huge leap in that direction.