Thanks. Your comment about BLIP2 already projecting RGB inputs into (a different) text token space makes sense to me. See also fpgaminer's comment at https://news.ycombinator.com/item?id=35603246 . However, I don't see how the universal approximation theorem is relevant here. The fact that deep models with sufficient capacity can approximate any function does not imply that two deep models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.
>I don't see how the universal approximation theorem is relevant here. The fact that deep models
The universal approximation is exactly not about deep models. Deep means many layers. But in the most simple (and proven) case, a single hidden layer perceptron is all it needs according to the UAT. Technically it also needs a nonlinear activation function, but you get all sorts of nonlinearities for free downstream anyways in this particular model.
You'd need to increase width (dimensionality) if you make these models shallow.
My point still stands: The fact that models with sufficient capacity can approximate any function does not imply that two models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.
The UAT states that depth is fundamentally not important, at least theoretically. It only has immense practical uses. So adding an intermediate linear layer + some nonlinearity already gets you an error scaling like O(1/N) for width N (in theory), regardless of what you are actually mapping. At least as long as it's somewhat continuous.