Feedforward NN's are only useful to a very narrow\* set of problems (\*-> compar...

Houshalter · on Sept 3, 2014

They are universal function approximators, which means they can map any set of input values to any set of output values. Of course to do this sometimes requires rote memorization of every possible input and it's output, rather than generalizing the function with a few parameters.

Adding more layers improves on this and allows you to make functions that compose multiple smaller functions. The problem with this is the nonlinearities cause the gradients to explode or vanish after a few layers. So the amount of computing power required to train them is huge.

Recurrent NNs had the same problem since they are equivalent to a very deep feed forward network; where every layer is a time step and the weights between every layer are the same.

But the invention of Long Short Term Memory has made training RNNs practical. Basically, as I understand it, some connections do not use nonlinearities so the gradients don't explode or vanish.