A visual proof that neural nets can compute any function

otabdeveloper1 · on Aug 21, 2015

> Every continuous function in the function space can be represented as a linear combination of basis functions

https://en.wikipedia.org/wiki/Basis_function

This is basic and obvious math. Does slapping the word 'neural' magically make obvious results 1000% more interesting? Why? Because the word 'neural' carries some of that artificial-intelligence-technology-of-the-future cachet?

gizmo686 · on Aug 21, 2015

As the article says:

" If you're a mathematician the argument is not difficult to follow, but it's not so easy for most people. That's a pity, since the underlying reasons for universality are simple and beautiful."

Indeed, as a mathematician, the universality of neural networks is obvious to me from their definition. However, this article is explicitly not aimed at mathematicians, and (as far as I can tell) does a good job of presenting the argument without requiring unnecessary math knowledge, which is something that we tend to be bad at doing.

Furthermore, the universality does not directly follow from the quote you provide, as you would still need to show that the hidden layer neurons form a basis. Infact, the article was about constructing such a basis in the neurons. The only thing that bringing in your quote would serve to do is make the article more confusing by unnecessarily introducing the concept of basis functions.

dicroce · on Aug 21, 2015

As a non mathematician, this is non obvious to me.

Thinking about it a bit (I haven't finished reading the article yet)... Since the size of the hidden layer isn't specified, I suppose you could have a hidden layer node for every possible input... So, of course any function is computable with a neural network. Really the magical thing here is finding the smallest set of nodes that computes the function...

49531 · on Aug 21, 2015

I've been toying with a NN trying to get it to play 2048 based on game data I recorded. I still have about 60% error rate, but I found that with 16 inputs (the tiles on the game), and 4 outputs (directions to move), it works best like a funnel.

I currently have 2 hidden layers, of 12 and 8 and it's the best I've gotten so far.

gizmo686 · on Aug 21, 2015

That is essentially correct. The subtlety is that the hidden layer still has a finite (but arbitrarily large) number of nodes, while there are an infinite number of inputs. The solution to this is that you can keep adding nodes until the space between the nodes is close enough to the function you are modelling.

DougMerritt · on Aug 23, 2015

> This is basic and obvious math.

Yes. But only once you prove that these are in fact space-spanning basis functions.

Does slapping the world 'basis' magically make results obvious and interesting? Why? Because the word 'basis' carries some of that abstract-mathematics cachet?

knappador · on Aug 21, 2015

Right. I had a moment of severe mistrust develop around all the times I've heard the phrase "neural network" mouthed in a pitched voice after I studied the math and realized it was barely cooler than Taylor series approximation. For a moment stacked Boltzmann machines were going to stay in obscurity, but then they became "deep learning" while also solving real problems. If there is one thing I could contribute to the popular programming lexicon, it would be making sure that the phrase "isomorphic" gets used for something other than javascript.

shoo · on Aug 21, 2015

I am glad that the article mentions the Stone-Weierstrass approximation theorem [1], an analogous result for polynomials from 1885.

More recently, there's now "Chebfun" [2], an open-source Matlab library for working with highly accurate 1- and 2- dimensional polynomial approximations to arbitrary continuous functions.

Approximation theory: pretty useful, pretty well-established, under-hyped.

[1] https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theo... [2] http://www.chebfun.org/

mostly_harmless · on Aug 21, 2015

My favourite way of looking at it is to imagine neurons in a neural net as analogue nand-gates.

The logic within a computer processor is made entirely with nand-gates, and a processor is able to compute any function. Nand-gates have 'functional completeness' and can implement any other gate (AND, OR, NOT, NOR), and any other high level construct with those functions.

Similarly, neurons in a neural net can output a 'NAND' function if set up correctly.

This provides a logical inference of computational completeness of neural-nets

mg · on Aug 21, 2015

I wonder how efficiently it can do that compared to other systems.

For example a short iterative function like this:

    function(complex c)
      complex z=0
      int steps=0
      while (z<someNumber & steps<someNumber)
        z=z^2+c;
        steps++;
      return steps<someNumber;

Can calculate with extremely high accuracy if a point in the complex plane is in the mandelbrot set or not. I would assume that a NN with the same accuracy would be of enormous size. It would probably have way more neurons then there are atoms in the universe.

3v3rt · on Aug 21, 2015

I think one of the caveats was that the function should be relatively smooth and continuous, a mandelbrot set isn't very smooth is it?

fnbr · on Aug 21, 2015

Smoothness isn't necessary to prove the result, just continuity.

You do need smoothness to prove a bound on the rate of convergence of the basis representation, and given that the boundary of the Mandelbrot doesn't have a closed form representation (as far as I'm aware), I think that the convergence of the neural network representation would be extremely slow.

Houshalter · on Aug 21, 2015

The advantage of neural networks is that they can be trained. You can give it a set of inputs and desired outputs, and do gradient descent on it.

Neural networks are essentially like trainable digital circuits. The proof of universality shows that neurons can approximate any kind of logic gate (or any input output mapping, like a lookup table.) A lookup table by itself isn't terribly useful, but you can put them in a series and make arbitrary circuits. And that makes (deep) neural networks strictly more powerful than "shallow" methods.

Retric · on Aug 21, 2015

A very important caveat is the ability to be trained + universality does not mean they can be trained to fit any function to arbitrary precision in finite time.

Houshalter · on Aug 21, 2015

Well of course not. If you could fit any neural network to any function quickly, you would have super powers. But in practice, local optima do seem to stop being an issue in big neural networks.

Also this article shows a method of how to construct a lookup table from a neural network in linear time. So in the worst case you can just memorize the input to output table, quickly. In the best case you can fit a simple elegant model, which fits the data perfectly with very few parameters. Given unlimited amounts of time to search the parameter space. Real world NNs are somewhere between these two.

fnbr · on Aug 21, 2015

I think you're correct.

That's one of the issues with results from theoretical math- the result is true, but it might not be useful.

As another commenter in the thread talks about, we can get a nearly identical result by using polynomials of extremely large degree, but that doesn't work well, because of overfitting. We could come up with a polynomial that also calculates with arbitrarily high accuracy if a point is in the Mandelbrot set or not, but it might end up being of 10100 degree, which is useless.

reagency · on Aug 22, 2015

My gut sense is that 10100 degreed wouldn't even be enough to render a pixel accurate mandlebrot set on your phone's screen. So many twisty edges.

chestervonwinch · on Aug 21, 2015

I tend to think that the real advantage of big nets is that they're simply compositions of matrix-vector operations (with some component-wise non-linearity tossed in), which allows them to scale more naturally to massive problems on the GPU ... Don't get me wrong - the universal approximation theorem is important - but I think this is just the first property an approximator must have. I would be interested to see if the network model could be shown to be remarkable in some other way.

solomatov · on Aug 21, 2015

They can compute any function, but the same can be said about splines, polynomials and many other ways to approximate functions. The problem is with an algorithms which will not lead to overfitting and can really reasonably approximate any function.

dragontamer · on Aug 21, 2015

Wavelets are my favorite approximate function to build up with.

But yeah, the overfitting problem / approximation is the real issue. I do like seeing the hybrid approaches (Genetic Algorithsm searching random Neural Nets with a little bit of backpropigation for some measure)

JumpCrisscross · on Aug 21, 2015

>No matter what the function, there is guaranteed to be a neural network so that for every possible input, x, the value f(x) (or some close approximation) is output from the network

Love NNs, but accuracy is relative. At some resolution of "close approximation," every function "computes" every other function.

throwaway39202 · on Aug 21, 2015

What the author means by "close approximation" is clarified later (under the "Two caveats" header). The point is that you can get an arbitrarily close approximation of any continuous function, just by choosing a sufficiently large number of hidden units. That is, for any epsilon > 0, a neural network can approximate any function within epsilon of the function's exact value at all points, with enough hidden units. The smaller the epsilon, the better the approximation you want, and the more hidden units you need.

TheOtherHobbes · on Aug 21, 2015

What if you're not approximating a continuous function?

chestervonwinch · on Aug 21, 2015

Well, compact continuous functions are dense in L1, so if you're discontinuous but at least integrable, you probably will still get a nice approximation. Take a box function for instance. Strictly speaking this is not continuous, but it's easy to see it can be approximated arbitrarily well by a couple of sigmoids.

leni536 · on Aug 21, 2015

I think it's mostly accurate in L2 norm and not point by point. So every function in L2 can be approximated arbitrarily in L2 norm.

chestervonwinch · on Aug 21, 2015

This is not true. The theorem proves point-wise convergence for a continuous functions on a compact subset of R^n. See theorem 2, page 6: http://deeplearning.cs.cmu.edu/pdfs/Cybenko.pdf

mwexler · on Aug 21, 2015

Note that this is one chapter of a good intro "online book" to neural nets. http://neuralnetworksanddeeplearning.com/index.html is a good place to start.

forkandwait · on Aug 21, 2015

I think an important thing to remember about neural nets is that they are basically a good way to overfit. So, yes, you can approximate any function because you have lots and lots of variables, but you shouldn't fool yourself that you are getting the same information as when you write a deterministic equation with a few variables ("with four parameters I can fit an elephant, with five I can make him wiggle his trunk"... Von Neumann). You can the the baseball but you don't yet know the physics.

I think skilled biological actors use this overfitting to get really good at things without knowing how things actually work.

NN are still cool, though.

effie · on Aug 21, 2015

> I think skilled biological actors use this overfitting to get really good at things without knowing how things actually work.

Perhaps that's the mechanism behind intuition?

GuiA · on Aug 21, 2015

> Consider the problem of naming a piece of music based on a short sample of the piece. That can be thought of as computing a function. Or consider the problem of translating a Chinese text into English. Again, that can be thought of as computing a function. Or consider the problem of taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting. Again, that can be thought of as a kind of function computation. Universality means that, in principle, neural networks can do all these things and many more.

Eesh, that seems like poor technical written, and potentially misleading. It's far from clear that these are computable functions in the first place; Penrose argues (see notably "Shadows of the Mind") that there are functions which can be computed by the human brain that are undecidable over the Church-Turing thesis.

"Just because we know a neural network exists that can (say) translate Chinese text into English, that doesn't mean we have good techniques for constructing or even recognizing such a network." seems very hand wavy - we don't know that such a neural network exists without first going through a big set of still controversial assumptions.

The rest of the article is good, but I'm not a big fan of that section. It's not a huge part, but I think this could give some students the wrong idea about what is a computable function, and what neural networks can compute.

cLeEOGPw · on Aug 21, 2015

By function I think he meant mapping input points to output points in an abstract plane.

So in that sense a piece of music, or a sentence in one language is a point of input, while name of music or sentence in another language is another point.

Everything is a function as long as there is a way to turn that thing into inputs that correspond to outputs.

GuiA · on Aug 21, 2015

Ok, so by your reasoning let's have a function as a point of input, and whether it halts or not as a point of output.

So now we have a function, I can't wait till "we have good techniques for constructing such a network" that maps those inputs and outputs in an abstract plane :)

cLeEOGPw · on Aug 21, 2015

That is a good example. But you are forgetting that neural networks are approximating the actual functions, so the function you described could be built with some kind of confidence level in the answer. Just like you can have some confidence that certain code will not halt from experience, neural network could also be built to do that. Not all possible functions though, unless you have infinitely large network with infinite computing power.

groar · on Aug 21, 2015

Yeah but it's still misleading as we are talking about approximating continuous functions here, not any function. Those examples are not clearly computable, or even just continuous..

MrManatee · on Aug 21, 2015

I would agree that the examples can be somewhat misleading, but not because of computability. Universal approximation theorem for neural networks doesn't care about computability, but it does require that the domain of the function is finite (or to be more precise, compact).

For example, suppose that the function to be approximated is simply f(x) = x. For any real numbers a < b we can produce an approximation that works well for a ≤ x ≤ b, but it cannot be a good approximation for all real numbers x. (The sigmoid in the hidden layer means that every approximation has to be a bounded function, and a bounded function cannot be a good global approximation of f(x) = x.)

Therefore, the translation example works if we assume that the number of different Chinese texts is finite, but otherwise nothing is guaranteed.

bluesign · on Aug 21, 2015

Isn't this similar to infinite monkey theorem: "Given an infinite length of time, a chimpanzee punching at random on a typewriter would almost surely type out all of Shakespeare's plays."

Given that we can have infinite big neural network and train it in an infinite length of time, it can compute any function.

tfgg · on Aug 21, 2015

I don't think so, since the former is about the occurrence of sequences in stochastic, ergodic processes, and the latter is about approximating deterministic functions.

It's more like given an infinite number of parameters, you have enough degrees of freedom to describe the vector space a function lives in with some set of basis functions, and a neural net supplies that.

jarradhope · on Aug 21, 2015

If you're interested in this you should check out Chris Eliasmith's lab - they have a Neural Engineering Framework and NENGO, which in it's current implementation is like a Python to neural network compiler.

http://www.nengo.ca

yugoja · on Aug 21, 2015

Is anybody reading this book? How good is it for a beginner?

IngoBlechschmid · on Aug 21, 2015

The book is truly excellent. You need some knowledge of (partial) derivatives and matrices, but other than that it doesn't require any more sophisticated mathematics.

We used it to prepare a 10-day course on neuronal networks for high school students. For anyone interested (and capable of reading German), see https://github.com/iblech/mathematik-der-vorhersagen for some notes, Python code, and videos.

p1esk · on Aug 21, 2015

This is probably the best book about neural networks for beginners. It's full of intuitive explanations, not too heavy on math, and it has simple code examples.

bobm_kite9 · on Aug 21, 2015

I love the inline videos and graphs. How was this done?

p1esk · on Aug 21, 2015

I believe you can see all the source code in your browser.

drdeca · on Aug 21, 2015

I don't think this always works if the function in question has infinitely many discontinuities...

w0000t · on Aug 21, 2015

You missed this paragraph, which occurs in the first 10% of the article.

The second caveat is that the class of functions which can be approximated in the way described are the continuous functions. If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won't in general be possible to approximate using a neural net.

p1esk · on Aug 21, 2015

What would be an example of a practical task that needs to be modeled with such a function?

gizmo686 · on Aug 21, 2015

Rounding

cLeEOGPw · on Aug 21, 2015

Rounding is something humans can't do either. how easily can you round a stick that is almost exactly 1.5 meters long? You would be ever sure that you are correct.

And as for rounding in mathematical way, that wouldn't be a problem for neural network, since it is a large distance from 1.4999... and 1.5000... from neural network's perspective, just like it is for you.

p1esk · on Aug 21, 2015

Good example, but in this case it's actually very easy to find weights for the neural network that would do rounding with any desired precision (look for the "stack of towers" method in the article).

gizmo686 · on Aug 21, 2015

Yeah, I think I picked rounding because the article primed me to think about it (the proof presented is essentially based on the ability to create steps, and rounding is my go-to example of a step function). However, any attempt to construct such a network would have a finite amount of steps in it, while the actual step function would have infinite.

In practice, if we needed such a network, we would probably have a restricted domain, so there is only a finite number of discontinuous, and there would be a sufficiently large network that could solve it to an arbitrary precision.

We would not be able to do this with more exotic functions that are densly discontinues, such as the function:

    f(x) = 0  iff x is rational
    f(x) = 1  iff x is irrational

There is no way a neural network can reasonably model that function.

im3w1l · on Aug 21, 2015

Works for all numbers except those ending in .5

And they are quite important to get right.

_efcy · on Aug 21, 2015

I am a neural network, and I can round.

rspeer · on Aug 21, 2015

If you lived in the 1800's, would you have described yourself as an automaton made of gears?

_efcy · on Aug 21, 2015

Perhaps - it's not a bad analogy. But calling myself a neural network is not an analogy - it's literal.

_efcy · on Aug 21, 2015

I find it very interesting that I'm being downvoted based on the difference between my self-concept and whoever is doing the downvoting (they don't think they are a neural network). I'm pretty sure that's not what downvotes are supposed to be used for. Perhaps those who are doing the downvoting would be interested in reading the textbook dedicated to the notion that human beings are neural networks, called Computational Cognitive Neuroscience: https://grey.colorado.edu/CompCogNeuro

p1esk · on Aug 21, 2015

I think you might be getting downvotes because the neural networks discussed in the article have nothing to do with the neural structures in a human brain, and it appears that you're oblivious to this fact.

DanBC · on Aug 21, 2015

Stop talking about how downvotes should be used.

It's tedious, and wrong.

You aren't just getting downvotes from people who disagree with you (if that's why you're getting downvotes); you are failing to gain upvotes from anyone who either agrees with you or thinks your downvotes are unfair.

People downvote to disagree - and there's nothing in any site faqs or guidelines to say this is wrong - but those votes are balanced by people upvoting unfairly downvoted posts.

mistaken · on Aug 21, 2015

Also a single layer is not enough to approximate functions in higher dimensions.

w0000t · on Aug 21, 2015

This is also addressed in the article.

Reading the title is not enough, you have to read the article.

grondilu · on Aug 21, 2015

Even weirdos like the Weirstrass function?

https://en.wikipedia.org/wiki/Weierstrass_function

gizmo686 · on Aug 21, 2015

One of the caveats mentioned in the article is that the neural networks do not necessarily compute a given function precisely, but rather approximate it to an arbitrary accuracy. In the case of the Weierstrass function, this removes any weirdness. This is because, for any accuracy, you could approximate the Weierstrass function as a polynomial. Note that the polynomial example is only to show that the weirdness of the Weierstrass function is not relevant in this context. Approximating it as a polynomial is not necessarily a good approach to approximating it as a neural network.

bra-ket · on Aug 21, 2015

this is really cool

kpil · on Aug 21, 2015

Isn't it so that all continous functions can be approximated with sums of sine waves too?

That is also cool but without the AI connotations...

titanomachy · on Aug 21, 2015

I don't know, sine waves are pretty smart... I think it's only a matter of time before they take over the world.

alvin0 · on Aug 21, 2015

And Not just sine waves. If combining using harmonics and different amplitudes is allowed, any base shape would suffice.