I tend to think that the real advantage of big nets is that they're simply compositions of matrix-vector operations (with some component-wise non-linearity tossed in), which allows them to scale more naturally to massive problems on the GPU ... Don't get me wrong - the universal approximation theorem is important - but I think this is just the first property an approximator must have. I would be interested to see if the network model could be shown to be remarkable in some other way.