Thanks. Your comment about BLIP2 already projecting RGB inputs into (a different...

sigmoid10 · on April 17, 2023

>I don't see how the universal approximation theorem is relevant here. The fact that deep models

The universal approximation is exactly not about deep models. Deep means many layers. But in the most simple (and proven) case, a single hidden layer perceptron is all it needs according to the UAT. Technically it also needs a nonlinear activation function, but you get all sorts of nonlinearities for free downstream anyways in this particular model.

cs702 · on April 17, 2023

You'd need to increase width (dimensionality) if you make these models shallow.

My point still stands: The fact that models with sufficient capacity can approximate any function does not imply that two models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.

sigmoid10 · on April 18, 2023

The UAT states that depth is fundamentally not important, at least theoretically. It only has immense practical uses. So adding an intermediate linear layer + some nonlinearity already gets you an error scaling like O(1/N) for width N (in theory), regardless of what you are actually mapping. At least as long as it's somewhat continuous.