If I'm reading this right, the core of the argument is that since any continuous function is can be approximated via a taylor series expansion, then activation functions can be seen, in-effect, as polynomial in nature, and since a neuron layer is a linear transformation followed by an activation function, the the whole system is polynomial.
That's "technically" correct, but it feels like an academic cop-out. Interesting/useful transfer functions tend to be functions that take very large expansions to be approximated with any accuracy.
That is a serious practical problem that we find even in this paper. The authors were unable to fit a degree 3 polynomial to a subset (just 26 components) of MNIST data (handwritten digits) due to a "memory issue".
But mathematical theory need not be practical. The relation between NNs and polynomial regression might be a fruitful theoretical observation even if the equivalent polynomial regression is incalculable.
I don't think it is even fruitful. We already know that mappings that don't contain poles can be approximated in various ways: Taylor expansion, piecewise linear, fourier transforms.. Taylor expansion is the polynomial fitting for authors, piecewise linear is NN with relu activation.
Not to mention that as a practical matter, the ability to train a neural network with backpropagation is important to get results that actually converge in a reasonable amount of time. It's not useful to say "but you could just use a polynomial regression" if you can't actually generate the equivalent polynomial regression in the same amount of time that you can train a neural network.
I'm not sure that's actually correct. In fact I'm sure it is incorrect for a polynomial of degree 1. Otherwise, there's nothing special about relu or tanh that you can't use sequential/backprop on polynomial regression in general.
Not to be nitpicky: it is not the Taylor polynomial (it does not approach any continuous function). It is a result by weierstrass on polynomial approximation on a closed interval.
f(x)=exp(-x^2) has the same Taylor expansion as g(x)=0 at x=0.
Wouldn’t the math academics have seen this in literally one second. How was this not asserted even earlier? Just genuinely curious as a completely unaware programmer
My gut feeling is that yes, this is pretty much self evident.
However, the interesting part of the paper is that they use that equivalency to propose that properties of polynomial regression are applicable to Neural networks, and draw some conclusions from that.
> any continuous function is can be approximated via a taylor series expansion
We can get a good polynomial approximation of any continuous function but just on bounded set. Wouldn't such assumption (restriction of activation function domain) be problematic?
I think that's a very good point. Yes, you can approximate any given clasical NN with a polynomial, but how does the number of terms in the polynomial scale with the network size and the desired accuracy? There might be a very good paper there.
That's "technically" correct, but it feels like an academic cop-out. Interesting/useful transfer functions tend to be functions that take very large expansions to be approximated with any accuracy.