I don't remember the name of the theorem, but you can approximate any nonlinear multivariable function arbitrarily with a multi-layer perceptron with any non-polynomial nonlinear function, applied after the linear weights and bias. It has to be non-polynomial because the set of all polynomials is closed under linear combinations, adding a constant, and composition, so if the nonlinear function was (say) x^3 you would only get polynomials out of the model.
I'm not sure why that's a problem because polynomial approximations are still useful.
For one, only continuous functions can be represented.
Much more importantly, the theorem doesn't prove that it's possible to learn the necessary weights to approximate any function, just that such weights much exist.
With our current methods, only a subset of all possible NNs are actually trainable, so we can only automate the construcion approximations for certain continuous functions (generally those that are differentiable, but there may be exceptions, I'm not as sure).
If we're talking about approximation, continuous functions can converge to step functions just fine. Take a regular sigmoid and keep raising the weight to see one example. That's a good point about training though, that theorem doesn't fully explain why NNs work, although it somewhat sounds like it does.
There are discontinuous functions which are not steps - one of the more useful ones being tan(x) for any real x. Of course, since tan(x) is piece wise continuous and periodic, it is probably easy to work around in practice.
The basic approximation theorem you might be thinking of is known as Kolmagorov's Theorem (dude got around). It's an early theorem, from 1957, that's about the universality of functions of a linear combination of single variable function.
But all the other universality theorems refer back to it and don't have their own names, for example; Optimal approximation of continuous functions by very deep ReLU networks by Dmitry Yarotsky [1]. The reference for the original theorem would be On The Structure Of Continuous Functions Of Several Variable, David A. Sprecker [2],
I'm not sure why that's a problem because polynomial approximations are still useful.